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Introduction 
class="introduction" 
When you 
have large 
amounts 
of data, 
you will 
need to 
organize 
itina 
way that 
makes 
sense. 
These 
ballots 
from an 
election 
are rolled 
together 
with 
similar 
ballots to 
keep them 
organized 
. (credit: 
William 
Greeson) 


Note: 
Chapter Objectives 
By the end of this chapter, the student should be able to: 


e Display data graphically and interpret graphs: stemplots, histograms, 
and box plots. 

e Recognize, describe, and calculate the measures of location of data: 
quartiles and percentiles. 

e Recognize, describe, and calculate the measures of the center of data: 
mean, median, and mode. 

e Recognize, describe, and calculate the measures of the spread of data: 
variance, standard deviation, and range. 


Once you have collected data, what will you do with it? Data can be 
described and presented in many different formats. For example, suppose 
you are interested in buying a house in a particular area. You may have no 
clue about the house prices, so you might ask your real estate agent to give 
you a sample data set of prices. Looking at all the prices in the sample often 
is overwhelming. A better way might be to look at the median price and the 
variation of prices. The median and variation are just two ways that you 
will learn to describe data. Your agent might also provide you with a graph 
of the data. 


In this chapter, you will study numerical and graphical ways to describe and 
display your data. This area of statistics is called "Descriptive Statistics." 
You will learn how to calculate, and even more importantly, how to 
interpret these measurements and graphs. 


A Statistical graph is a tool that helps you learn about the shape or 
distribution of a sample or a population. A graph can be a more effective 
way of presenting data than a mass of numbers because we can see where 
data clusters and where there are only a few data values. Newspapers and 
the Internet use graphs to show trends and to enable readers to compare 
facts and figures quickly. Statisticians often graph data first to get a picture 
of the data. Then, more formal tools may be applied. 


Some of the types of graphs that are used to summarize and organize data 
are the dot plot, the bar graph, the histogram, the stem-and-leaf plot, the 
frequency polygon (a type of broken line graph), the pie chart, and the box 
plot. In this chapter, we will briefly look at stem-and-leaf plots, line graphs, 
and bar graphs, as well as frequency polygons, and time series graphs. Our 
emphasis will be on histograms and box plots. 


Note: 

NOTE 

This book contains instructions for constructing a histogram and a box plot 
for the TI-83+ and TI-84 calculators. The Texas Instruments (TI) website 
provides additional instructions for using these calculators. 


Stem-and-Leaf Graphs (Stemplots), Line Graphs, and Bar Graphs 


One simple graph, the stem-and-leaf graph or stemplot, comes from the field of exploratory data 
analysis. It is a good choice when the data sets are small. To create the plot, divide each 
observation of data into a stem and a leaf. The leaf consists of a final significant digit. For 
example, 23 has stem two and leaf three. The number 432 has stem 43 and leaf two. Likewise, the 
number 5,432 has stem 543 and leaf two. The decimal 9.3 has stem nine and leaf three. Write the 
stems in a vertical line from smallest to largest. Draw a vertical line to the right of the stems. Then 
write the leaves in increasing order next to their corresponding stem. 


Example: 

For Susan Dean's spring pre-calculus class, scores for the first exam were as follows (smallest to 
largest): 

Bia Giwe alee Ale [5S Way Siar (lls (ap (72 (tele (Gia (SSE (a8 722 Wak Wale Tok tell tsyey tote (etsy {atop Glo We Cale 
94; 94; 94; 96; 100 


Stem Leaf 

3 3 

4 299 

5 355 

6 1378899 
7 2348 

8 03888 

3 0244446 
10 0 


Stem-and-Leaf Graph 


The stemplot shows that most scores fell in the 60s, 70s, 80s, and 90s. Eight out of the 31 scores 
or approximately 26% — _ were in the 90s or 100, a fairly high number of As. 


Note: 


Try It 
Exercise: 


Problem: 


For the Park City basketball team, scores for the last 30 games were as follows (smallest to 
largest): 

BP Bye aioe Ble Stop lop dive aloe alsie alale lige diye aly/s aakels ate aes als) SY0e SiO Syils Sys swe Sys bya 
Byile Soe 7s 7S 0s Gil 

Construct a stem plot for the data. 


Solution: 
Stem Leaf 
3 22348 
4 022346778889 
fs) 00122234677 
6 01 


The stemplot is a quick way to graph data and gives an exact picture of the data. You want to look 
for an overall pattern and any outliers. An outlier is an observation of data that does not fit the rest 
of the data. It is sometimes called an extreme value. When you graph an outlier, it will appear not 
to fit the pattern of the graph. Some outliers are due to mistakes (for example, writing down 50 
instead of 500) while others may indicate that something unusual is happening. It takes some 
background information to explain outliers, so we will cover them in more detail later. 


Example: 

The data are the distances (in kilometers) from a home to local supermarkets. Create a stemplot 
using the data: 

iL ile IS¢ 2.38 Bee Qe B22 Shs BL8s shoe See 40s al ve dlls als al, 7/o al fe} 1s) Ise (5 (6p (6.5e (L715 123) 
Exercise: 


Problem: Do the data seem to have any concentration of values? 


Note: 
NOTE 
The leaves are to the right of the decimal. 


Solution: 


The value 12.3 may be an outlier. Values appear to concentrate at three and four kilometers. 


Stem Leaf 
1 15 
2 357 
3 23358 
4 025578 
5 56 
6 57 
7 
8 
) 
10 
11 
#2 3 
Note: 
Try It 


Exercise: 


Problem: 


The following data show the distances (in miles) from the homes of off-campus statistics 
students to the college. Create a stem plot using the data and identify any outliers: 


O75 0572 toils ee ees he 1S, eA O02 0) 2 oy 2.02.2. 05 2.0. 2.07 ao: 
3.8; 4.4; 4.8; 4.9; 5.2; 5.5; 5.7; 5.8; 8.0 


Solution: 
Stem Leaf 
0 57 
1 12233 55)7 7.89 
D 0256888 
3 58 
A 489 
5 27.6 
6 
7 
8 0 


The value 8.0 may be an outlier. Values appear to concentrate at one and two miles. 


Example: 
Exercise: 


Problem: 


A side-by-side stem-and-leaf plot allows a comparison of the two data sets in two columns. 
In a side-by-side stem-and-leaf plot, two sets of leaves share the same stem. The leaves are to 
the left and the right of the stems. [link] and [link] show the ages of presidents at their 
inauguration and at their death. Construct a side-by-side stem-and-leaf plot using this data. 


Solution: 


Ages at Inauguration 
OE 7 IG a2 
8777766655554444422111110 


9854421110 


President Age President 
Washington 57 Lincoln 

J. Adams 61 A. Johnson 
Jefferson 57 Grant 
Madison 57 Hayes 
Monroe 58 Garfield 

J. Q. Adams 57 Arthur 
Jackson 61 Cleveland 
Van Buren 54 B. Harrison 


W. H. Harrison 68 Cleveland 


Age 
52 
56 
46 
54 
49 
51 
47 
55 


55 


Ages at Death 

69 

366778 
003344567778 


0011147889 


01358 

0033 

President Age 
Hoover 54 
F. Roosevelt 51 
Truman 60 
Eisenhower 62 
Kennedy 43 
L. Johnson 55 
Nixon 56 
Ford 61 
Carter 52 


President 
Tyler 
Polk 
Taylor 
Fillmore 
Pierce 


Buchanan 


President 
Washington 
J. Adams 
Jefferson 
Madison 
Monroe 

J. Q. Adams 
Jackson 

Van Buren 
W. H. Harrison 
Tyler 

Polk 


Taylor 


Age 
51 
49 
64 
50 
48 


65 


Presidential Ages at Inauguration 


Age 
67 
90 
83 
85 
73 
80 
78 
79 
68 
71 
53 


65 


President 
McKinley 

T. Roosevelt 
Taft 

Wilson 
Harding 


Coolidge 


President 
Lincoln 

A. Johnson 
Grant 
Hayes 
Garfield 
Arthur 
Cleveland 
B. Harrison 
Cleveland 
McKinley 
T. Roosevelt 


Taft 


Age 
54 
42 
51 
56 
55 


51 


Age 
56 
66 
63 
70 
49 
56 
71 
67 
71 
58 
60 


72 


President 
Reagan 
G.H.W. Bush 
Clinton 

G. W. Bush 


Obama 


President 
Hoover 

F. Roosevelt 
Truman 
Eisenhower 
Kennedy 

L. Johnson 
Nixon 

Ford 


Reagan 


Age 
69 
64 
47 
54 


47 


Age 
90 
63 
88 
78 
46 
64 
81 
93 


93 


President Age President Age President Age 


Fillmore 74 Wilson 67 
Pierce 64 Harding 57 
Buchanan 77 Coolidge 60 


Presidential Age at Death 


Note: 
Exercise: 


Problem: 


The table shows the number of wins and losses the Atlanta Hawks have had in 42 seasons. 
Create a side-by-side stem-and-leaf plot of these wins and losses. 


Losses Wins Year Losses Wins Year 

34 48 1968-1969 41 41 1989-1990 
34 48 1969-1970 39 43 1990-1991 
46 36 1970-1971 44 38 1991-1992 
46 36 1971-1972 39 43 1992-1993 
36 46 1972-1973 25 57 1993-1994 
47 35 1973-1974 40 42 1994-1995 
51 ol 1974-1975 36 46 1995-1996 
53 23 1975-1976 26 56 1996-1997 
51 31 1976-1977 32 50 1997-1998 
41 Al 1977-1978 19 31 1998-1999 


36 46 1978-1979 54 28 1999-2000 


Losses 


32 


51 


40 


39 


42 


48 


Be 


25 


32 


30 


Solution: 


Wins 


50 


31 


42 


43 


40 


34 


50 


57 


50 


52 


Year 


1979-1980 


1980-1981 


1981-1982 


1982-1983 


1983-1984 


1984-1985 


1985-1986 


1986-1987 


1987-1988 


1988-1989 


Atlanta Hawks Wins and Losses 


Number of Wins 


3 


98865 


8766554311110 


88766633322110 


776320000 


Losses Wins Year 

57 25 2000-2001 
49 33 2001-2002 
47 35 2002-2003 
54 28 2003-2004 
69 13 2004-2005 
56 26 2005-2006 
52 30 2006-2007 
45 37 2007-2008 
35 47 2008-2009 
25 53 2009-2010 


Number of Losses 


9 


559 


02222445666999 


0011245667789 


111234467 


9 


Another type of graph that is useful for specific data values is a line graph. In the particular line 
graph shown in [link], the x-axis (horizontal axis) consists of data values and the y-axis (vertical 
axis) consists of frequency points. The frequency points are connected using line segments. 


Example: 
In a survey, 40 mothers were asked how many times per week a teenager must be reminded to do 
his or her chores. The results are shown in [link] and in [link]. 


Number of times teenager is reminded Frequency 
0 2 
il 5 
Z 8 
3 14 
4 7 
5 4 
16 
14 
12 
> 
2 10 
5 8 
i-7 
2 6 
rs 
4 
2 
0 


0 1 2 3 4 5 6 
Number of times teenager is reminded 


Note: 
Try It 
Exercise: 


Problem: 


Ina 


survey, 40 people were asked how many times per year they had their car in the shop for 


repairs. The results are shown in [link]. Construct a line graph. 


Number of times in shop Frequency 
0 is 
1 10 
2 14 
3 3 
Solution: 
16 
14 
12 
> 
2 10 
5 8 
o 
2 6 
Ww 
4 
2 
0 
0 1 2 3 


Number of times in shop 


Bar graphs consist of bars that are separated from each other. The bars can be rectangles or they 
can be rectangular boxes (used in three-dimensional plots), and they can be vertical or horizontal. 
The bar graph shown in [link] has age groups represented on the x-axis and proportions on the y- 


axis. 


Example: 
Exercise: 


Problem: 


By the end of 2011, Facebook had over 146 million users in the United States. [link] shows 
three age groups, the number of users in each age group, and the proportion (%) of users in 


each age group. Construct a bar graph using this data. 


Age groups 
leas 
26-44 


45-64 


Solution: 
50 


45 
40 
35 


Proportion (%) 
ibe} 
oa 


13-25 


Note: 
Try It 
Exercise: 


Problem: 


Number of Facebook users 


65,082,280 
53,300,200 


27,885,100 


Proportion (“%) of Facebook users 
45% 
36% 


19% 


The population in Park City is made up of children, working-age adults, and retirees. [link] 
shows the three age groups, the number of people in the town from each age group, and the 
proportion (%) of people in each age group. Construct a bar graph showing the proportions. 


Age groups Number of people Proportion of population 
Children 67,059 19% 
Working-age adults 152,198 43% 


Retirees 131,662 38% 


Solution: 
50% 
45% 
40% 
35% 
30% 
25% 
20% 
15% 
10% 
5% 
0% 


Proportion (%) 


Children Working-age adults _‘ Retirees 
Age group 


Example: 
Exercise: 


Problem: 


The columns in [link] contain: the race or ethnicity of students in U.S. Public Schools for the 
class of 2011, percentages for the Advanced Placement examine population for that class, and 
percentages for the overall student population. Create a bar graph with the student race or 
ethnicity (qualitative data) on the x-axis, and the Advanced Placement examinee population 
percentages on the y-axis. 


AP Examinee Overall Student 
Race/Ethnicity Population Population 
1 = Asian, Asian American or 10.3% 5.7% 


Pacific Islander 


2 = Black or African American 9.0% 14.7% 


AP Examinee Overall Student 


Race/Ethnicity Population Population 
3 = Hispanic or Latino 17.0% 17.6% 
4= American Indian or Alaska 0.6% 1.1% 
Native 
5 = White 57.1% 59.2% 
6 = Not reported/other 6.0% 1.7% 

Solution: 

7) 

® 57.1 

£ 

E 

c 

5 

a 

<x 

6 

5 ae 17.0 

s oo 6.0 


0.6 


1 2 3 4 5 6 
Race/Ethnicity 


Note: 
Try It 
Exercise: 


Problem: 
Park city is broken down into six voting districts. The table shows the percent of the total 
registered voter population that lives in each district as well as the percent total of the entire 


population that lives in each district. Construct a bar graph that shows the registered voter 
population by district. 


District Registered voter population Overall city population 


1 15.5% 19.4% 


District Registered voter population Overall city population 


D 12.2% 15.6% 
3 9.8% 9.0% 
4 17.4% 18.5% 
5 22.8% 20.7% 
6 22.3% 16.8% 
Solution: 
25.0% 
= 20.0% 
= 
° 
€ 15.0% 
8 
© 10.0% 
~ 
2 5.0% 
> 
0.0% 


District 
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Chapter Review 


A stem-and-leaf plot is a way to plot data and look at the distribution. In a stem-and-leaf plot, all 
data values within a class are visible. The advantage in a stem-and-leaf plot is that all values are 
listed, unlike a histogram, which gives classes of data values. A line graph is often used to 
represent a set of data values in which a quantity varies with time. These graphs are useful for 


finding trends. That is, finding a general pattern in data sets including temperature, sales, 
employment, company profit or cost over a period of time. A bar graph is a chart that uses either 
horizontal or vertical bars to show comparisons among categories. One axis of the chart shows the 
specific categories being compared, and the other axis represents a discrete value. Some bar graphs 
present bars clustered in groups of more than one (grouped bar graphs), and others show the bars 
divided into subparts to show cumulative effect (stacked bar graphs). Bar graphs are especially 
useful when categorical data is being used. 


For each of the following data sets, create a stem plot and identify any outliers. 
Exercise: 


Problem: 
The miles per gallon rating for 30 cars are shown below (lowest to highest). 


19, 19, 19, 20, 21, 21, 25, 25, 25, 26, 26, 28, 29, 31, 31, 32, 32, 33, 34, 35, 36, 37, 37, 38, 38, 
38, 38, 41, 43, 43 


Solution: 
Stem Leaf 
1 999 
2 0115556689 
3 11223456778888 
A 133 

Exercise: 
Problem: 


The height in feet of 25 trees is shown below (lowest to highest). 
25, 27, 33, 34, 34, 34, 35, 37, 37, 38, 39, 39, 39, 40, 41, 45, 46, 47, 49, 50, 50, 53, 53, 54, 54 


Exercise: 
Problem: 
The data are the prices of different laptops at an electronics store. Round each value to the 
nearest ten. 


249, 249, 260, 265, 265, 280, 299, 299, 309, 319, 325, 326, 350, 350, 350, 365, 369, 389, 409, 
459, 489, 559, 569, 570, 610 


Solution: 


Stem Leaf 
2 556778 
3 001233555779 
4 169 
5 677 
6 | 

Exercise: 

Problem: 


The data are daily high temperatures in a town for one month. 
61, 61, 62, 64, 66, 67, 67, 67, 68, 69, 70, 70, 70, 71, 71, 72, 74, 74, 74, 75, 75, 75, 76, 76, 77, 
28, 78,79, 79, 95 


For the next three exercises, use the data to construct a line graph. 
Exercise: 


Problem: 


In a survey, 40 people were asked how many times they visited a store before making a major 
purchase. The results are shown in [link]. 


Number of times in store Frequency 
1 4 
2 10 


Number of times in store Frequency 
4 6 


) 4 


Solution: 
18 
16 


Frequency 


1 2 3 4 5 
Number of times in store 


Exercise: 


Problem: 


In a survey, several people were asked how many years it has been since they purchased a 
mattress. The results are shown in [link]. 


Years since last purchase Frequency 
0 2 

1 8 

2 13 

3 22 

4 16 

5 q 


Exercise: 


Problem: 


Several children were asked how many TV shows they watch each day. The results of the 
survey are shown in [link]. 


Number of TV Shows Frequency 
0 12 
1 18 
2 36 
3 7 
4 2 
Solution: 
40 
35 
30 
3 25 
i= 
S 20 
fl 
& 15 
10 
5 
0 
0 1 2 3 4 
TV shows watched per day 
Exercise: 
Problem: 


The students in Ms. Ramirez’s math class have birthdays in each of the four seasons. [link] 
shows the four seasons, the number of students who have birthdays in each season, and the 
percentage (%) of students in each group. Construct a bar graph showing the number of 
students. 


Seasons Number of students Proportion of population 


Spring 8 24% 
Summer g 26% 
Autumn 11 32% 
Winter 6 18% 
Exercise: 
Problem: 


Using the data from Mrs. Ramirez’s math class supplied in [link], construct a bar graph 
showing the percentages. 


25% 


10% 


Spring Summer Autumn Winter 
Birthdays in each season 


Exercise: 


Problem: 


David County has six high schools. Each school sent students to participate in a county-wide 
science competition. [link] shows the percentage breakdown of competitors from each school, 
and the percentage of the entire student population of the county that goes to each school. 
Construct a bar graph that shows the population percentage of competitors from each school. 


High School Science competition population Overall student population 
Alabaster 28.9% 8.6% 


Concordia 7.6% 23.2% 


High School 
Genoa 
Mocksville 
Tynneson 


West End 


Exercise: 


Problem: 


Science competition population 
12.1% 
18.5% 
24.2% 


8.7% 


Overall student population 
15.0% 
14.3% 
10.1% 


28.8% 


Use the data from the David County science competition supplied in [link]. Construct a bar 
graph that shows the county-wide population percentage of students at each school. 


Solution: 
35.0% 
30.0% 
25.0% 
20.0% 
15.0% 


Proportion (%) 


10.0% 
5.0% 
0.0% 


Alabaster Concordia 


Genoa Mocksville Tynneson West End 


Students in science competition from each school 


Homework 


Exercise: 


Problem: Student grades on a chemistry exam were: 77, 78, 76, 81, 86, 51, 79, 82, 84, 99 


a. Construct a stem-and-leaf plot of the data. 
b. Are there any potential outliers? If so, which scores are they? Why do you consider them 


outliers? 


Exercise: 


Problem: [link] contains the 2010 obesity rates in U.S. states and Washington, DC. 


State 


Alabama 


Alaska 
Arizona 
Arkansas 
California 


Colorado 


Connecticut 


Delaware 


Washington, 
DC 


Florida 
Georgia 


Hawaii 


Idaho 


Illinois 


Indiana 


Iowa 


Kansas 


Percent 
(%) 


32.2 


22:5 


28.0 


28.4 


29.4 


State 


Kentucky 


Louisiana 
Maine 
Maryland 
Massachusetts 


Michigan 


Minnesota 


Mississippi 


Missouri 


Montana 
Nebraska 
Nevada 


New 
Hampshire 


New Jersey 


New Mexico 


New York 


North 
Carolina 


Percent 
(%) 


31.3 


31.0 
26.8 
27.1 
23.0 


30.9 


24.8 


34.0 


30.5 


23.0 
26.9 


22.4 


25.0 


23.8 


25.1 


23.9 


27.8 


State 


North 
Dakota 


Ohio 
Oklahoma 
Oregon 
Pennsylvania 
Rhode Island 


South 
Carolina 


South 
Dakota 


Tennessee 


Texas 
Utah 


Vermont 


Virginia 


Washington 


West 
Virginia 


Wisconsin 


Wyoming 


Percent 
(%) 


27.2 


29.2 
30.4 
26.8 
28.6 


25.5 


31.5 


27.3 


26.3 


25.1 


a. Use a random number generator to randomly pick eight states. Construct a bar graph of 
the obesity rates of those eight states. 
b. Construct a bar graph for all the states beginning with the letter "A." 


c. Construct a bar graph for all the states beginning with the letter "M." 


Solution: 


a. Example solution for using the random number generator for the TI-84+ to generate a 
simple random sample of 8 states. Instructions are as follows. 


o Number the entries in the table 1-51 (Includes Washington, DC; Numbered 
vertically) 

Press MATH 

Arrow over to PRB 

Press 5:randInt( 

Enter 51,1,8) 


oo 0 90 


Eight numbers are generated (use the right arrow key to scroll through the numbers). The 
numbers correspond to the numbered states (for this example: {47 21 9 23 51 13 25 4}. 
If any numbers are repeated, generate a different number by using 5:randInt(51,1)). Here, 
the states (and Washington DC) are {Arkansas, Washington DC, Idaho, Maryland, 
Michigan, Mississippi, Virginia, Wyoming}. 


Corresponding percents are {30.1, 22.2, 26.5, 27.1, 30.9, 34.0, 26.0, 25.1}. 
40 


35 
30 


Percent (%) 
nN 
3 


Percent (%) 


Alabama Alaska Arizona = Arkansas 


Percent (%) 


Histograms, Frequency Polygons, and Time Series Graphs 


For most of the work you do in this book, you will use a histogram to display the data. One advantage of a 
histogram is that it can readily display large data sets. A rule of thumb is to use a histogram when the data set 
consists of 100 values or more. 


A histogram consists of contiguous (adjoining) boxes. It has both a horizontal axis and a vertical axis. The 
horizontal axis is labeled with what the data represents (for instance, distance from your home to school). The 
vertical axis is labeled either frequency or relative frequency (or percent frequency or probability). The graph 
will have the same shape with either label. The histogram (like the stemplot) can give you the shape of the data, 
the center, and the spread of the data. 


The relative frequency is equal to the frequency for an observed value of the data divided by the total number of 
data values in the sample. (Remember, frequency is defined as the number of times an answer occurs.) If: 


e f= frequency 
e n= total number of data values (or the sum of the individual frequencies), and 
e RF = relative frequency, 


then: 
Equation: 
RF = Pi 
n 
For example, if three students in Mr. Ahab's English class of 40 students received from 90% to 100%, then, f= 3, n 
= 40, and RF = fe 4 = 0.075. 7.5% of the students received 90—100%. 90—100% are quantitative measures. 


To construct a histogram, first decide how many bars or intervals, also called classes, represent the data. Many 
histograms consist of five to 15 bars or classes for clarity. The number of bars needs to be chosen. Choose a 
starting point for the first interval to be less than the smallest data value. A convenient starting point is a lower 
value carried out to one more decimal place than the value with the most decimal places. For example, if the value 
with the most decimal places is 6.1 and this is the smallest value, a convenient starting point is 6.05 (6.1 — 0.05 = 
6.05). We say that 6.05 has more precision. If the value with the most decimal places is 2.23 and the lowest value 
is 1.5, a convenient starting point is 1.495 (1.5 — 0.005 = 1.495). If the value with the most decimal places is 3.234 
and the lowest value is 1.0, a convenient starting point is 0.9995 (1.0 — 0.0005 = 0.9995). If all the data happen to 
be integers and the smallest value is two, then a convenient starting point is 1.5 (2 — 0.5 = 1.5). Also, when the 
starting point and other boundaries are carried to one additional decimal place, no data value will fall on a 
boundary. The next two examples go into detail about how to construct a histogram using continuous data and how 
to create a histogram using discrete data. 


Example: 

The following data are the heights (in inches to the nearest half inch) of 100 male semiprofessional soccer players. 
The heights are continuous data, since height is measured. 

60;/60:5; 6161-615 

68553.03.5205-5 

64; 64; 64; 64; 64; 64; 64; 64.5; 64.5; 64.5; 64.5; 64.5; 64.5; 64.5; 64.5 

66; 66; 66; 66; 66; 66; 66; 66; 66; 66; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 67; 67; 67; 
OIE (O72 (Oe O72 BE O78 GS O72 O78 Oiase Gay O78 O71a8 Os Kase O7/45) 

6831007093095 099609509469 409409409109 10915 109i 409 Oo oOo ED 

HOE AVE TOR 70 700 708 ZS ADESE ZAI? 718 72 7A 

2S TDS YDS TAs VOSS 1S 35) 

74 


The smallest data value is 60. Since the data with the most decimal places has one decimal (for instance, 61.5), we 
want our starting point to have two decimal places. Since the numbers 0.5, 0.05, 0.005, etc. are convenient 
numbers, use 0.05 and subtract it from 60, the smallest value, for the convenient starting point. 

60 — 0.05 = 59.95 which is more precise than, say, 61.5 by one decimal place. The starting point is, then, 59.95. 
The largest value is 74, so 74 + 0.05 = 74.05 is the ending value. 

Next, calculate the width of each bar or class interval. To calculate this width, subtract the starting point from the 
ending value and divide by the number of bars (you must choose the number of bars you desire). Suppose you 
choose eight bars. 

Equation: 


74.05 — 59.95 


= 1.76 
8 


Note: 

NOTE 

We will round up to two and make each bar or class interval two units wide. Rounding up to two is one way to 
prevent a value from falling on a boundary. Rounding to the next number is often necessary even if it goes 
against the standard rules of rounding. For this example, using 1.76 as the width would also work. A guideline 
that is followed by some for the number of bars or class intervals is to take the square root of the number of data 
values and then round to the nearest whole number, if necessary. For example, if there are 150 values of data, 
take the square root of 150 and round to 12 bars or intervals. 


The boundaries are: 


59.95 

BOS 2 = OILS 
61.95 + 2 = 63.95 
63°95 + 2)— 65.95 
G5) 1 2 = (9/215 
67.95 + 2)— 69.95 
Ge) gist 2 = WL I5; 
O TALS) ar 2 = 733)5) 
0 ae) 2 = PS5 


The heights 60 through 61.5 inches are in the interval 59.95—61.95. The heights that are 63.5 are in the interval 
61.95-63.95. The heights that are 64 through 64.5 are in the interval 63.95-65.95. The heights 66 through 67.5 are 
in the interval 65.95-67.95. The heights 68 through 69.5 are in the interval 67.95-69.95. The heights 70 through 
71 are in the interval 69.95—71.95. The heights 72 through 73.5 are in the interval 71.95—73.95. The height 74 is 
in the interval 73.95—75.95. 


The following histogram displays the heights on the x-axis and relative frequency on the y-axis. 
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Note: 
Try It 
Exercise: 


Problem: 


The following data are the shoe sizes of 50 male students. The sizes are continuous data since shoe size is 
measured. Construct a histogram and calculate the width of each bar or class interval. Suppose you choose 
six bars. 
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Solution: 

Smallest value: 9 

Largest value: 14 

Convenient starting value: 9 — 0.05 = 8.95 


Convenient ending value: 14 + 0.05 = 14.05 


14.05—8.95 __ 
1405-895 — 0.85 


The calculations suggests using 0.85 as the width of each bar or class interval. You can also use an interval 
with a width equal to one. 


Example: 

Create a histogram for the following data: the number of books bought by 50 part-time college students at ABC 
College.the number of books bought by 50 part-time college students at ABC College. The number of books is 
discrete data, since books are counted. 
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Eleven students buy one book. Ten students buy two books. Sixteen students buy three books. Six students buy 
four books. Five students buy five books. Two students buy six books. 

Because the data are integers, subtract 0.5 from 1, the smallest data value and add 0.5 to 6, the largest data value. 
Then the starting point is 0.5 and the ending value is 6.5. 

Exercise: 


Problem: 


Next, calculate the width of each bar or class interval. If the data are discrete and there are not too many 
different values, a width that places the data values in the middle of the bar or class interval is the most 
convenient. Since the data consist of the numbers 1, 2, 3, 4, 5, 6, and the starting point is 0.5, a width of one 
places the 1 in the middle of the interval from 0.5 to 1.5, the 2 in the middle of the interval from 1.5 to 2.5, 


the 3 in the middle of the interval from 2.5 to 3.5, the 4 in the middle of the interval from to 
, the 5 in the middle of the interval from to , and the in the middle of the 
interval from to 


Solution: 


e 3.5 to 4.5 
e 4.5to5.5 
e 6 

e 5.5 to 6.5 


Calculate the number of bars as follows: 
Equation: 


6.5 — 0.5 
number of bars _ 


where 1 is the width of a bar. Therefore, bars = 6. 
The following histogram displays the number of books on the x-axis and the frequency on the y-axis. 
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Note: 


Go to [link]. There are calculator instructions for entering data and for creating a customized histogram. Create 
the histogram for [link]. 


Press Y=. Press CLEAR to delete any equations. 
Press STAT 1:EDIT. If L1 has data in it, arrow up into the name L1, press CLEAR and then arrow down. If 
necessary, do the same for L2. 


e Into L1, enter 1, 2, 3, 4,5, 6. 

e Into L2, enter 11, 10, 16, 6, 5, 2. 

e Press WINDOW. Set Xmin = .5, Xmax = 6.5, Xscl = (6.5 — .5)/6, Ymin = -1, Ymax = 20, Yscl = 1, Xres = 1. 

¢ Press 2" Y=. Start by pressing 4:Plotsoff ENTER. 

e Press 2" y=. Press 1:Plotl. Press ENTER. Arrow down to TYPE. Arrow to the 3" picture (histogram). 
Press ENTER. 


Arrow down to Xlist: Enter L1 (2" 1). Arrow down to Freq. Enter L2 (2"¢ 2). 
Press GRAPH. 
Use the TRACE key and the arrow keys to examine the histogram. 


Note: 
Try It 
Exercise: 


Problem: 


The following data are the number of sports played by 50 student athletes. The number of sports is discrete 
data since sports are counted. 


SP Oe OR oh Bhos ope 
20 student athletes play one sport. 22 student athletes play two sports. Eight student athletes play three 
sports. 


Fill in the blanks for the following sentence. Since the data consist of the numbers 1, 2, 3, and the starting 
point is 0.5, a width of one places the 1 in the middle of the interval 0.5 to , the 2 in the middle of the 
interval from to , and the 3 in the middle of the interval from to 


Solution: 
iL 


1135) 100) 225) 
2.5 to 3.5 


Example: 
Exercise: 


Problem: Using this data set, construct a histogram. 


Number of Hours My Classmates Spent Playing Video Games on Weekends 


9.95 10 Das 16.75 0 

19.5 DRS 7.5 15 12.75 

5.5 11 10 20.75 17.5 

23 21.9 24 23.75 18 

20 15 DD) 18.8 20.5 
Solution: 


Hours Spent Playing Video Games 
on Weekends 


R 
fo) 


Number of students 
OrPNWHA UTDN WO OO 


0 5 10 15 20 25 
Number of hours 


Some values in this data set fall on boundaries for the class intervals. A value is counted in a class interval if 
it falls on the left boundary, but not if it falls on the right boundary. Different researchers may set up 
histograms for the same data in different ways. There is more than one correct way to set up a histogram. 


Note: 
Try It 
Exercise: 


Problem: 


The following data represent the number of employees at various restaurants in New York City. Using this 
data, create a histogram. 


22351526 40281820 25343942 24221927 22344020 38and 28 
Use 10-19 as the first interval. 


Note: 

Count the money (bills and change) in your pocket or purse. Your instructor will record the amounts. As a class, 
construct a histogram displaying the data. Discuss how many intervals you think is appropriate. You may want to 
experiment with the number of intervals. 


Frequency Polygons 


Frequency polygons are analogous to line graphs, and just as line graphs make continuous data visually easy to 
interpret, so too do frequency polygons. 


To construct a frequency polygon, first examine the data and decide on the number of intervals, or class intervals, 
to use on the x-axis and y-axis. After choosing the appropriate ranges, begin plotting the data points. After all the 
points are plotted, draw line segments to connect them. 


Example: 
A frequency polygon was constructed from the frequency table below. 


Frequency Distribution for Calculus Final Test Scores 


Lower Bound Upper Bound Frequency Cumulative Frequency 
49.5 59.5 5 5 
59.5 69.5 10 15 


69.5 79.5 30 45 


Frequency Distribution for Calculus Final Test Scores 


Lower Bound Upper Bound Frequency 
79.5 89.5 40 
89.5 99.5 15 


Test Scores 


Frequency 


445 545 645 745 845 94.5 104.5 
Scores 


Cumulative Frequency 
85 


100 


The first label on the x-axis is 44.5. This represents an interval extending from 39.5 to 49.5. Since the lowest test 
score is 54.5, this interval is used only to allow the graph to touch the x-axis. The point labeled 54.5 represents the 
next interval, or the first “real” interval from the table, and contains five scores. This reasoning is followed for 
each of the remaining intervals with the point 104.5 representing the interval from 99.5 to 109.5. Again, this 
interval contains no data and is only used so that the graph will touch the x-axis. Looking at the graph, we say that 
this distribution is skewed because one side of the graph does not mirror the other side. 


Note: 
Try It 
Exercise: 


Problem: Construct a frequency polygon of U.S. Presidents’ ages at inauguration shown in [link]. 


Age at Inauguration 
41.5-46.5 
46.5-51.5 
51.5-56.5 
56.5-61.5 
61.5-66.5 


66.5—71.5 


Frequency 
4 
il 


14 


Solution: 


The first label on the x-axis is 39. This represents an interval extending from 36.5 to 41.5. Since there are no 
ages less than 41.5, this interval is used only to allow the graph to touch the x-axis. The point labeled 44 
represents the next interval, or the first “real” interval from the table, and contains four scores. This 
reasoning is followed for each of the remaining intervals with the point 74 representing the interval from 
71.5 to 76.5. Again, this interval contains no data and is only used so that the graph will touch the x-axis. 
Looking at the graph, we say that this distribution is skewed because one side of the graph does not mirror 
the other side. 

President’s Age at Inauguration 


Frequency 


b 
SOPNWAUANDOSD 


Frequency polygons are useful for comparing distributions. This is achieved by overlaying the frequency polygons 
drawn for different data sets. 


Example: 


We will construct an overlay frequency polygon comparing the scores from [link] with the students’ final numeric 
grade. 


Frequency Distribution for Calculus Final Test Scores 


Lower Bound Upper Bound Frequency Cumulative Frequency 
49.5 59.5 5 5 

59.5 69.5 10 15 

69.5 79.5 30 45 

79.5 89.5 40 85 


89.5 99.5 15 100 


Frequency Distribution for Calculus Final Grades 


Lower Bound Upper Bound Frequency Cumulative Frequency 
49.5 59.5 10 10 

59.5 69.5 10 20 

69.5 79.5 30 50 

79.5 89.5 45 95 

89.5 99.5 5 100 


Final Test Grade v Final Grade 


Frequency 
N 
a 


445 545 645 745 845 945 104.5 
Grades 


Suppose that we want to study the temperature range of a region for an entire month. Every day at noon we note 
the temperature and write this down in a log. A variety of statistical studies could be done with this data. We could 
find the mean or the median temperature for the month. We could construct a histogram displaying the number of 
days that temperatures reach a certain range of values. However, all of these methods ignore a portion of the data 
that we have collected. 


One feature of the data that we may want to consider is that of time. Since each date is paired with the temperature 
reading for the day, we don‘t have to think of the data as being random. We can instead use the times given to 
impose a chronological order on the data. A graph that recognizes this ordering and displays the changing 
temperature as the month progresses is called a time series graph. 


Constructing a Time Series Graph 


To construct a time series graph, we must look at both pieces of our paired data set. We start with a standard 
Cartesian coordinate system. The horizontal axis is used to plot the date or time increments, and the vertical axis is 
used to plot the values of the variable that we are measuring. By doing this, we make each point on the graph 
correspond to a date and a measured quantity. The points on the graph are typically connected by straight lines in 
the order in which they occur. 


Example: 
Exercise: 


Problem: 


The following data shows the Annual Consumer Price Index, each month, for ten years. Construct a time 
series graph for the Annual Consumer Price Index data only. 


Year 


2003 


2004 


2005 


2006 


2007 


2008 


2009 


2010 


2011 


2012 


Year 


2003 


2004 


2005 


2006 


2007 


2008 


2009 


2010 


Jan 
181.7 
185.2 
190.7 
198.3 
202.416 
211.080 
211.143 
216.687 
220.223 


226.665 


Aug 
184.6 
189.5 
196.4 
203.9 
207.917 
219.086 
215.834 


218.312 


Feb 
183.1 
186.2 
191.8 
198.7 
203.499 
211.693 
212.193 
216.741 
221.309 


227.663 


Sep 
185.2 
189.9 
198.8 
202.9 
208.490 
218.783 
215.969 


218.439 


Mar 


184.2 


187.4 


193.3 


199.8 


205.352 


213.528 


212.709 


217.631 


223.467 


229.392 


Oct 


185.0 


190.9 


199.2 


201.8 


Apr 
183.8 
188.0 
194.6 
201.5 
206.686 
214.823 
213.240 
218.009 
224.906 


230.085 


208.936 


216.573 


216.177 


218.711 


May 
183.5 
189.1 
194.4 
202.5 
207.949 
216.632 
213.856 
218.178 
225.964 


229.815 


Nov 
184.5 
191.0 
197.6 
201.5 
210.177 
212.425 
216.330 


218.803 


Jun 


183.7 


189.7 


194.5 


202.9 


208.352 


218.815 


215.693 


217.965 


225.722 


229.478 


Dec 


184.3 


190.3 


196.8 


201.8 


210.036 


210.228 


215.949 


219.179 


Jul 


183.9 


189.4 


195.4 


203.5 


208.299 


219.964 


215.351 


218.011 


225.922 


229.104 


Annual 


184.0 


188.9 


195.3 


201.6 


207.342 


215.303 


214.537 


218.056 


Year Aug Sep Oct Nov Dec Annual 


2011 226.545 226.889 226.421 226.230 225.672 224.939 
2012 230.379 231.407 231.317 230.221 229.601 229.594 
Solution 
Annual CPI 


Annual consumer 
price index 
nN 
b 
o 
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2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 
Year 


Note: 
Try It 
Exercise: 


Problem: 


The following table is a portion of a data set from www.worldbank.org. Use the table to construct a time 
series graph for CO, emissions for the United States. 


CO2 Emissions 


Ukraine United Kingdom United States 
2003 352,259 540,640 5,681,664 
2004 343,121 540,409 5,790,761 
2005 339,029 541,990 5,826,394 
2006 327,797 542,045 5,737,615 
2007 328,357 528,631 5,828,697 
2008 323,657 522,247 5,656,839 
2009 272,176 474,579 5,299,563 


Solution: 


US CO, Emissions 


CO, emissions in kt (millions) 


2003 2004 2005 2006 2007 2008 2009 


Uses of a Time Series Graph 


Time series graphs are important tools in various applications of statistics. When recording values of the same 
variable over an extended period of time, sometimes it is difficult to discern any trend or pattern. However, once 
the same data points are displayed graphically, some features jump out. Time series graphs make trends easy to 
spot. 
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Chapter Review 


A histogram is a graphic version of a frequency distribution. The graph consists of bars of equal width drawn 
adjacent to each other. The horizontal scale represents classes of quantitative data values and the vertical scale 
represents frequencies. The heights of the bars correspond to frequency values. Histograms are typically used for 
large, continuous, quantitative data sets. A frequency polygon can also be used when graphing large data sets with 
data points that repeat. The data usually goes on y-axis with the frequency being graphed on the x-axis. Time series 
graphs can be helpful when looking at large amounts of data for one variable over a period of time. 

Exercise: 


Problem: 
Sixty-five randomly selected car salespersons were asked the number of cars they generally sell in one week. 


Fourteen people answered that they generally sell three cars; nineteen generally sell four cars; twelve 
generally sell five cars; nine generally sell six cars; eleven generally sell seven cars. Complete the table. 


Data Value (# cars) Frequency Relative Frequency Cumulative Relative Frequency 


Exercise: 


Problem: What does the frequency column in [link] sum to? Why? 


Solution: 


65 


Exercise: 


Problem: What does the relative frequency column in [link] sum to? Why? 


Exercise: 
Problem: What is the difference between relative frequency and frequency for each data value in [link]? 


Solution: 


The relative frequency shows the proportion of data points that have each value. The frequency tells the 
number of data points that have each value. 
Exercise: 


Problem: 


What is the difference between cumulative relative frequency and relative frequency for each data value? 


Exercise: 


Problem: 
To construct the histogram for the data in [link], determine appropriate minimum and maximum x and y 


values and the scaling. Sketch the histogram. Label the horizontal and vertical axes with words. Include 
numerical scaling. 


Solution: 


Answers will vary. One possible histogram is shown: 
20 


Frequency 


3 4 5 6 7 8 
Number of cars sold 


Exercise: 


Problem: Construct a frequency polygon for the following: 


a. Pulse Rates for Women Frequency 
60-69 12 
70-79 14 
80-89 11 
90-99 if 
100-109 1 
110-119 0 


120-129 1 


b. Actual Speed in a 30 MPH Zone Frequency 


42-45 25 
46-49 14 
50-53 7 
54-57 3 
58-61 1 

c. Tar (mg) in Nonfiltered Cigarettes Frequency 
10-13 1 
14-17 0 
18-21 15 
22-25 7 
26-29 2 
Exercise: 
Problem: 


Construct a frequency polygon from the frequency distribution for the 50 highest ranked countries for depth 
of hunger. 


Depth of Hunger Frequency 
230-259 21 
260-289 13 
290-319 5 

320-349 7 

350-379 1 


380-409 1 


Depth of Hunger Frequency 


410-439 1 


Solution: 


Find the midpoint for each class. These will be graphed on the x-axis. The frequency values will be graphed 
on the y-axis values. 
Depth of Hunger 
24 
20 


PR 


Frequency 
Of ON DD 


230-259 260-289 290-319 320-349 350-379 380-409 410-439 
Depth of hunger 


Exercise: 
Problem: 
Use the two frequency tables to compare the life expectancy of men and women from 20 randomly selected 


countries. Include an overlayed frequency polygon and discuss the shapes of the distributions, the center, the 
spread, and any outliers. What can we conclude about the life expectancy of women compared to men? 


Life Expectancy at Birth - Women Frequency 
49-55 3 

56-62 3 

63-69 1 

70-76 3 

77-83 8 

84-90 2 

Life Expectancy at Birth - Men Frequency 
49-55 3 


56-62 3 


Life Expectancy at Birth - Men 


63-69 


70-76 


77-83 


84-90 


Exercise: 


Problem: 


Construct a times series graph for (a) the number of male births, (b) the number of female births, and (c) the 


total number of births. 


Sex/Year 


Female 


Male 


Total 


Sex/Year 


Female 


Male 


Total 


Sex/Year 


Female 


Male 


Total 


1855 
45,545 
47,804 


93,349 


1862 
51,812 
55,257 


107,069 


1870 
56,431 
58,959 


115,390 


1856 
49,582 
52,239 


101,821 


1863 
53,115 
56,226 


109,341 


1871 
56,099 
60,029 


116,128 


1857 
50,257 
53,158 


103,415 


1864 
54,959 
57,374 


112,333 


1872 
57,472 
61,293 


118,76 


1858 
50,324 
53,694 


104,018 


1865 
54,850 
58,220 


113,070 


5 


Frequency 


1859 
51,915 
54,628 


106,543 


1866 
55,307 
58,360 


113,667 


1873 
58,233 
61,467 


119,700 


1 


1 


1860 
51,220 
54,409 


105,629 


1867 
55,927 
58,517 


114,044 


1874 
60,109 
63,602 


123,711 


1861 
52,403 
54,606 


107,009 


1868 
56,292 
59,222 


115,514 


1875 
60,146 
63,432 


123,578 


Solution: 


Births in Scotland 

130,000 5 
125,000 4 
120,000 4 
115,000 4 
110,000 4 
105,000 + 
100,000 4 
95,000 4 

90,000 4 

85,000 4 

80,000 4 

75,000 + 

70,000 + 

65,000 4 


60,000 4 
55,000 4 
50,000 4 


45,000 4 
40,000 


Number of births. 


ST 


Xp, “8. “05, By, ~85, Ge “85, “Oy, 8, “HG, “MH, HBr “Gs, “>, “>. “Gs 
Bay Ts a Ry “ap Ss Sy ay “Sy “os ay “ay “Py "Sy Ses “Say “yp 


Year 


— Both sexes -—- Males — Females 
Exercise: 
Problem: 


The following data sets list full time police per 100,000 citizens along with homicides per 100,000 citizens for 
the city of Detroit, Michigan during the period from 1961 to 1973. 


Year 1961 1962 1963 1964 1965 1966 1967 
Police 260.35 269.8 272.04 272.96 272.51 261.34 268.89 
Homicides 8.6 8.9 8.52 8.89 13.07 14.57 21.36 
Year 1968 1969 1970 1971 1972 1973 
Police 295.99 319.87 341.43 356.59 376.69 390.19 
Homicides 28.03 31.49 37.39 46.26 47.24 52.33 


a. Construct a double time series graph using a common x-axis for both sets of data. 
b. Which variable increased the fastest? Explain. 
c. Did Detroit’s increase in police officers have an impact on the murder rate? Explain. 


Homework 


Exercise: 


Problem: 


Suppose that three book publishers were interested in the number of fiction paperbacks adult consumers 
purchase per month. Each publisher conducted a survey. In the survey, adult consumers were asked the 
number of fiction paperbacks they had purchased the previous month. The results are as follows: 


# of books Freq. Rel. Freq. 
0 10 

1 12 

2 16 

3 12 

4 8 

5 6 

6 2 

8 2 

Publisher A 

# of books Freq. Rel. Freq. 
0 18 

1 24 

2 24 

3 22 

4 15 

5 10 

7 5 

9 1 


Publisher B 


# of books Freq. Rel. Freq. 


0-1 20 

2-3 35 

4-5 12 

6-7 2 

8-9 1 
Publisher C 


a. Find the relative frequencies for each survey. Write them in the charts. 

b. Using either a graphing calculator, computer, or by hand, use the frequency column to construct a 
histogram for each publisher's survey. For Publishers A and B, make bar widths of one. For Publisher C, 
make bar widths of two. 

c. In complete sentences, give two reasons why the graphs for Publishers A and B are not identical. 

d. Would you have expected the graph for Publisher C to look like the other two graphs? Why or why not? 

e. Make new histograms for Publisher A and Publisher B. This time, make bar widths of two. 

f. Now, compare the graph for Publisher C to the new graphs for Publishers A and B. Are the graphs more 
similar or more different? Explain your answer. 


Exercise: 


Problem: 


Often, cruise ships conduct all on-board transactions, with the exception of gambling, on a cashless basis. At 
the end of the cruise, guests pay one bill that covers all onboard transactions. Suppose that 60 single travelers 
and 70 couples were surveyed as to their on-board bills for a seven-day cruise from Los Angeles to the 
Mexican Riviera. Following is a summary of the bills for each group. 


Amount($) Frequency Rel. Frequency 
51-100 5 

101-150 10 

151-200 15 

201-250 15 

251-300 10 

301-350 5 


Singles 


Amount($) Frequency Rel. Frequency 


100-150 5 

201-250 5 

251-300 5 

301-350 5 

351-400 10 

401-450 10 

451-500 10 

501-550 10 

551-600 5 

601-650 5 

Couples 

a. Fill in the relative frequency for each group. 

b. Construct a histogram for the singles group. Scale the x-axis by $50 widths. Use relative frequency on 
the y-axis. 

c. Construct a histogram for the couples group. Scale the x-axis by $50 widths. Use relative frequency on 
the y-axis. 


d. Compare the two graphs: 


i. List two similarities between the graphs. 
ii. List two differences between the graphs. 
iii. Overall, are the graphs more similar or different? 


oO 


. Construct a new graph for the couples by hand. Since each couple is paying for two individuals, instead 
of scaling the x-axis by $50, scale it by $100. Use relative frequency on the y-axis. 
. Compare the graph for the singles with the new graph for the couples: 


ph 


i. List two similarities between the graphs. 
ii. Overall, are the graphs more similar or different? 


g. How did scaling the couples graph differently change the way you compared it to the singles graph? 
h. Based on the graphs, do you think that individuals spend the same amount, more or less, as singles as 
they do person by person as a couple? Explain why in one or two complete sentences. 


Solution: 


Amount($) Frequency Relative Frequency 


Amount($) Frequency Relative Frequency 


51-100 5 0.08 
101-150 10 0.17 
151-200 15 0.25 
201-250 15 0.25 
251-300 10 0.17 
301-350 5 0.08 
Singles 
Amount($) Frequency Relative Frequency 
100-150 5 0.07 
201-250 5 0.07 
251-300 5 0.07 
301-350 5 0.07 
351-400 10 0.14 
401-450 10 0.14 
451-500 10 0.14 
501-550 10 0.14 
551-600 5 0.07 
601-650 5 0.07 
Couples 


a. See [link] and [link]. 

b. In the following histogram data values that fall on the right boundary are counted in the class interval, 
while values that fall on the left boundary are not counted (with the exception of the first interval where 
both boundary values are included). 


Onboard Charges for Singles 
7-Day Cruise Sailing 
to the Mexican Riviera from LA 


0.25 
0.2 


0.1 
0.05 


Relative frequency 
° 
i 
uo 


50 100 150 200 250 300 350 


Amount ($) 
c. In the following histogram, the data values that fall on the right boundary are counted in the class 
interval, while values that fall on the left boundary are not counted (with the exception of the first 


interval where values on both boundaries are included). 


Onboard Charges for Singles 
7-Day Cruise Sailing to the Mexican Riviera from LA 


Relative Frequency 
° 
iB 
a 


100 150 200 250 300 350 400 450 500 550 600 650 
Amount ($) 


d. Compare the two graphs: 
i. Answers may vary. Possible answers include: 


= Both graphs have a single peak. 
« Both graphs use class intervals with width equal to $50. 


ii. Answers may vary. Possible answers include: 


= The couples graph has a class interval with no values. 
» It takes almost twice as many class intervals to display the data for couples. 


iii. Answers may vary. Possible answers include: The graphs are more similar than different because 
the overall patterns for the graphs are the same. 


e. Check student's solution. 
f. Compare the graph for the Singles with the new graph for the Couples: 


i. = Both graphs have a single peak. 
= Both graphs display 6 class intervals. 
= Both graphs show the same general pattern. 


ii. Answers may vary. Possible answers include: Although the width of the class intervals for couples 
is double that of the class intervals for singles, the graphs are more similar than they are different. 


g. Answers may vary. Possible answers include: You are able to compare the graphs interval by interval. It 
is easier to compare the overall patterns with the new scale on the Couples graph. Because a couple 
represents two individuals, the new scale leads to a more accurate comparison. 

h. Answers may vary. Possible answers include: Based on the histograms, it seems that spending does not 
vary much from singles to individuals who are part of a couple. The overall patterns are the same. The 
range of spending for couples is approximately double the range for individuals. 


Exercise: 


Problem: 


Twenty-five randomly selected students were asked the number of movies they watched the previous week. 
The results are as follows. 


# of movies Frequency Relative Frequency Cumulative Relative Frequency 
0 5 
1 9 
2 6 
3 4 
4 1 


a. Construct a histogram of the data. 
b. Complete the columns of the chart. 


Use the following information to answer the next two exercises: Suppose one hundred eleven people who shopped 
in a special t-shirt store were asked the number of t-shirts they own costing more than $19 each. 


40/111 


wo 
i=} 
E 


Relative frequency 
nN 
S 
= 
B 


1 2 3 4 5 6 7 
Number of T-shirts costing more than $19 each 


Exercise: 


Problem: 
The percentage of people who own at most three t-shirts costing more than $19 each is approximately: 


a. 21 
b. 59 
c. 41 
d. Cannot be determined 


Solution: 


Cc 


Exercise: 


Problem: 

If the data were collected by asking the first 111 people who entered the store, then the type of sampling is: 
a. cluster 
b. simple random 


c. stratified 
d. convenience 


Exercise: 


Problem: Following are the 2010 obesity rates by U.S. states and Washington, DC. 


Percent Percent Percent 
State (%) State (%) State (%) 
Alabama 32.2 Kentucky 31.3 Nort 579 
Dakota 
Alaska 24.5 Louisiana 31.0 Ohio 29.2 
Arizona 24.3 Maine 26.8 Oklahoma 30.4 
Arkansas 30.1 Maryland 27.1 Oregon 26.8 
California 24.0 Massachusetts 23.0 Pennsylvania 28.6 
Colorado 21.0 Michigan 30.9 Rhode Island 25.5 
Connecticut 22:5 Minnesota 24.8 Sou 31.5 
Carolina 
2 uth te South 
Delaware 28.0 Mississippi 34.0 Walco 27.3 
eee 22.2 Missouri 30.5 Tennessee 30.8 
Florida 26.6 Montana 23.0 Texas 31.0 
Georgia 29.6 Nebraska 26.9 Utah 22.5 
Hawaii 22.7 Nevada 22.4 Vermont 23.2 
Idaho 26.5 a 25.0 Virginia 26.0 
: Hampshire : 6 : 
Illinois 28.2 New Jersey 23.8 Washington 25.5 


Percent Percent Percent 


State (%) State (%) State (%) 
Indiana 29.6 New Mexico 25.1 Ma 32.5 
Virginia 
Iowa 28.4 New York 23.9 Wisconsin 26.3 
North : 
Kansas 29.4 Carcling 27.8 Wyoming 25.1 


Construct a bar graph of obesity rates of your state and the four states closest to your state. Hint: Label the x- 
axis with the states. 


Solution: 


Answers will vary. 


Glossary 


Frequency 
the number of times a value of the data occurs 


Histogram 
a graphical representation in x-y form of the distribution of data in a data set; x represents the data and y 
represents the frequency, or relative frequency. The graph consists of contiguous rectangles. 


Relative Frequency 
the ratio of the number of times a value of the data occurs in the set of all outcomes to the number of all 
outcomes 


Measures of the Location of the Data 
The common measures of location are quartiles and percentiles 


Quartiles are special percentiles. The first quartile, Q,, is the same as the 25" 
percentile, and the third quartile, Q3, is the same as the 75th percentile. The 
median, M, is called both the second quartile and the 50" percentile. 


To calculate quartiles and percentiles, the data must be ordered from smallest to 
largest. Quartiles divide ordered data into quarters. Percentiles divide ordered 
data into hundredths. To score in the 90" percentile of an exam does not mean, 
necessarily, that you received 90% on a test. It means that 90% of test scores are 
the same or less than your score and 10% of the test scores are the same or 
greater than your test score. 


Percentiles are useful for comparing values. For this reason, universities and 
colleges use percentiles extensively. One instance in which colleges and 
universities use percentiles is when SAT results are used to determine a minimum 
testing score that will be used as an acceptance factor. For example, suppose 
Duke accepts SAT scores at or above the 75" percentile. That translates into a 
score of at least 1220. 


Percentiles are mostly used with very large populations. Therefore, if you were to 
say that 90% of the test scores are less (and not the same or less) than your score, 
it would be acceptable because removing one particular data value is not 
significant. 


The median is a number that measures the "center" of the data. You can think of 
the median as the "middle value," but it does not actually have to be one of the 
observed values. It is a number that separates ordered data into halves. Half the 
values are the same number or smaller than the median, and half the values are 
the same number or larger. For example, consider the following data. 

Te The 6s 7223 Ay 0. 9) 10.56.02 6.382523 105 

Ordered from smallest to largest: 

13 22 24 6? 6:82 7.23 8.8.32-92 10: 10; 11.5 


Since there are 14 observations, the median is between the seventh value, 6.8, 
and the eighth value, 7.2. To find the median, add the two values together and 
divide by two. 

Equation: 


6847.2 _ 
T= 


7 


The median is seven. Half of the values are smaller than seven and half of the 
values are larger than seven. 


Quartiles are numbers that separate the data into quarters. Quartiles may or may 
not be part of the data. To find the quartiles, first find the median or second 
quartile. The first quartile, Q,, is the middle value of the lower half of the data, 
and the third quartile, Q3, is the middle value, or median, of the upper half of the 
data. To get the idea, consider the same data set: 

Lys 2:2 4e6: 6,037.2; BF 6.329210; 102 11-5 


The median or second quartile is seven. The lower half of the data are 1, 1, 2, 2, 
4, 6, 6.8. The middle value of the lower half is two. 
Ie 292: 4:6) 6.8 


The number two, which is part of the data, is the first quartile. One-fourth of the 
entire sets of values are the same as or less than two and three-fourths of the 
values are more than two. 


The upper half of the data is 7.2, 8, 8.3, 9, 10, 10, 11.5. The middle value of the 
upper half is nine. 


The third quartile, Q3, is nine. Three-fourths (75%) of the ordered data set are 
less than nine. One-fourth (25%) of the ordered data set are greater than nine. The 
third quartile is part of the data set in this example. 


The interquartile range is a number that indicates the spread of the middle half 
or the middle 50% of the data. It is the difference between the third quartile (Q3) 
and the first quartile (Q;). 


IOR = Q3—Q) 


The IQR can help to determine potential outliers. A value is suspected to be a 
potential outlier if it is less than (1.5)(7QR) below the first quartile or more 
than (1.5)(IQR) above the third quartile. Potential outliers always require 
further investigation. 


Note: 

NOTE 

A potential outlier is a data point that is significantly different from the other 
data points. These special data points may be errors or some kind of abnormality 
or they may be a key to understanding the data. 


Example: 
Exercise: 


Problem: 


For the following 13 real estate prices, calculate the JQR and determine if 
any prices are potential outliers. Prices are in dollars. 

389,950; 230,500; 158,000; 479,000; 639,000; 114,950; 5,500,000; 
387,000; 659,000; 529,000; 575,000; 488,800; 1,095,000 


Solution: 

Order the data from smallest to largest. 

114,950; 158,000; 230,500; 387,000; 389,950; 479,000; 488,800; 529,000; 
575,000; 639,000; 659,000; 1,095,000; 5,500,000 

M = 488,800 


Q, = 230.500 + 387,000 _ 398 750 


Qz = £39,000 + 659,000 _ E49 ggg 
2 s) 
IQR = 649,000 — 308,750 = 340,250 
(1.5)(IQR) = (1.5)(340,250) = 510,375 
Q, — (1.5)(IQR) = 308,750 — 510,375 = —201,625 
Qs + (1.5)(IQR) = 649,000 + 510,375 = 1,159,375 


No house price is less than —201,625. However, 5,500,000 is more than 
1,159,375. Therefore, 5,500,000 is a potential outlier. 


Note: 
Try It 
Exercise: 


Problem: 


For the following 11 salaries, calculate the JQR and determine if any 
salaries are outliers. The salaries are in dollars. 


$33,000 $64,500 $28,000 $54,000 $72,000 $68,500 $69,000 $42,000 
$54,000 $120,000 $40,500 


Solution: 


Order the data from smallest to largest. 


$28,000 $33,000 $40,500 $42,000 $54,000 $54,000 $64,500 $68,500 
$69,000 $72,000 $120,000 


Median = $54,000 

Q, = $40,500 

Q3 = $69,000 

IQR = $69,000 — $40,500 = $28,500 

(1.5)(IQR) = (1.5)($28,500) = $42,750 

Q, — (1.5)(IQR) = $40,500 — $42,750 = —$2,250 
Q; + (1.5)(IQR) = $69,000 + $42,750 = $111,750 


No salary is less than —$2,250. However, $120,000 is more than $11,750, so 
$120,000 is a potential outlier. 


Example: 
Exercise: 


Problem: 

For the two data sets in the test scores example, find the following: 
a. The interquartile range. Compare the two interquartile ranges. 
b. Any outliers in either set. 

Solution: 


The five number summary for the day and night classes is 


Minimum Qi Median Q3 Maximum 
Day 32 56 74.5 82.5 99 
Night 25.5 78 81 89 98 


a. The IQR for the day group is Q3 — Q, = 82.5 — 56 = 26.5 
The IQR for the night group is Q3 — Q; = 89 — 78 = 11 


The interquartile range (the spread or variability) for the day class is 
larger than the night class IQR. This suggests more variation will be 
found in the day class’s class test scores. 

b. Day class outliers are found using the IQR times 1.5 rule. So, 


© Q, - IQR(1.5) = 56 — 26.5(1.5) = 16.25 
© Qs + IQR(1.5) = 82.5 + 26.5(1.5) = 122.25 


Since the minimum and maximum values for the day class are greater 
than 16.25 and less than 122.25, there are no outliers. 


Night class outliers are calculated as: 


© Q; —IQR (1.5) = 78 — 11(1.5) = 61.5 
© Qs + IQR(1.5) = 89 + 11(1.5) = 105.5 


For this class, any test score less than 61.5 is an outlier. Therefore, the 
scores of 45 and 25.5 are outliers. Since no test score is greater than 
105.5, there is no upper end outlier. 


Note: 
Try It 
Exercise: 


Problem: 


Find the interquartile range for the following two data sets and compare 
them. 


Test Scores for Class A 

foebuslomre Upmeh dele VA CME Ge etoile (A SIMU woiam ie} epulale mie Nm yayeai ows me ol) ate fa 
Test Scores for Class B 

OO OU Oe AU Oe 2) oy Ot Ud Ieee arate) eos ey: 
100 


Solution: 

Class A 

Order the data from smallest to largest. 

65 66 67 69 69 76 77 77 79 80 81 83 85 89 90 91 94 96 98 99 
Median = *2¢8! = 80.5 


Qi = 418 = 72.5 


Ce 


IQR = 90.5 — 72.5 = 18 


Class B 


Order the data from smallest to largest. 


68 68°70 718727397379. 79:00:80 30:90 92°92 °95:9a:97-99 100 


Median = 2248 — 80 


Q, = 2B = 72.5 


Os SEE Se 


IQR = 93.5 — 72.5 = 21 


The data for Class B has a larger IQR, so the scores between Q3 and Q, 
(middle 50%) for the data for Class B are more spread out and not clustered 
about the median. 


Example: 


Fifty statistics students were asked how much sleep they get per school night 
(rounded to the nearest hour). The results were: 


AMOUNT 
OF 
SLEEP 
PER 
SCHOOL 
NIGHT 
(HOURS) 


4 


fs) 


FREQUENCY 
Z 


fs) 


RELATIVE 
FREQUENCY 


0.04 


0.10 


CUMULATIVE 
RELATIVE 
FREQUENCY 
0.04 


0.14 


AMOUNT 


OF 

SLEEP 

PER 

SCHOOL CUMULATIVE 
NIGHT RELATIVE RELATIVE 


(HOURS) FREQUENCY FREQUENCY FREQUENCY 


6 Z 0.14 0.28 
7 ie 0.24 0.52 
8 14 0.28 0.80 
9 iy 0.14 0.94 
10 3 0.06 1.00 


Find the 28" percentile. Notice the 0.28 in the "cumulative relative frequency" 
column. Twenty-eight percent of 50 data values is 14 values. There are 14 values 
less than the 28" percentile. They include the two 4s, the five 5s, and the seven 
6s. The 28" percentile is between the last six and the first seven. The 28" 
percentile is 6.5. 

Find the median. Look again at the "cumulative relative frequency" column and 
find 0.52. The median is the 50" percentile or the second quartile. 50% of 50 is 
25. There are 25 values less than the median. They include the two 4s, the five 
5s, the seven 6s, and eleven of the 7s. The median or 50" percentile is between 
the 25" or seven, and 26", or seven, values. The median is seven. 

Find the third quartile. The third quartile is the same as the 75" percentile. 
You can "eyeball" this answer. If you look at the "cumulative relative frequency" 
column, you find 0.52 and 0.80. When you have all the fours, fives, sixes and 
sevens, you have 52% of the data. When you include all the 8s, you have 80% of 
the data. The 75" percentile, then, must be an eight. Another way to look at 
the problem is to find 75% of 50, which is 37.5, and round up to 38. The third 
quartile, Q3, is the 38'" value, which is an eight. You can check this answer by 
counting the values. (There are 37 values below the third quartile and 12 values 
above.) 


Note: 
Try it 
Exercise: 


Problem: 


Forty bus drivers were asked how many hours they spend each day running 
their routes (rounded to the nearest hour). Find the 65" percentile. 


Amount of time Cumulative 

spent on route Relative Relative 

(hours) Frequency Frequency Frequency 

2 12 0.30 0.30 

3 14 0.35 0.65 

4 10 0.25 0.90 

5 4 0.10 1.00 
Solution: 


The 65" percentile is between the last three and the first four. 


The 65" percentile is 3.5. 


Example: 
Exercise: 


Problem: Using [link]: 


a. Find the 80" percentile. 
b. Find the 90" percentile. 
c. Find the first quartile. What is another name for the first quartile? 


Solution: 
Using the data from the frequency table, we have: 


a. The 80" percentile is between the last eight and the first nine in the 
table (between the 40" and 41°' values). Therefore, we need to take the 
mean of the 40" an 41° values. The 80" percentile = $42 = 8.5 

b. The 90" percentile will be the 45" data value (location is 0.90(50) = 
45) and the 45" data value is nine. 

c. Q; is also the 25" percentile. The 25" percentile location calculation: 
Poe = 0.25(50) = 12.5 » 13 the 13" data value. Thus, the 25th 
percentile is six. 


Note: 
Try It 
Exercise: 


Problem: 


Refer to the [link]. Find the third quartile. What is another name for the 
third quartile? 


Solution: 
The third quartile is the 75" percentile, which is four. The 65" percentile is 


between three and four, and the 90" percentile is between four and 5.75. 
The third quartile is between 65 and 90, so it must be four. 


Note: 
Collaborative Statistics 


Your instructor or a member of the class will ask everyone in class how many 
sweaters they own. Answer the following questions: 


1. How many students were surveyed? 

2. What kind of sampling did you do? 

3. Construct two different histograms. For each, starting value = ending 
value=_ 

4. Find the median, first quartile, and third quartile. 

5. Construct a table of the data to find the following: 


a. the 10" percentile 
b. the 70" percentile 
c. the percent of students who own less than four sweaters 


A Formula for Finding the kth Percentile 


If you were to do a little research, you would find several formulas for 
calculating the k" percentile. Here is one of them. 


k = the k"" percentile. It may or may not be part of the data. 
i = the index (ranking or position of a data value) 
n = the total number of data 


e Order the data from smallest to largest. 

¢ Calculate i = a(n +1) 

e If iis an integer, then the k" percentile is the data value in the i“" position in 
the ordered set of data. 

e If iis not an integer, then round i up and round i down to the nearest 
integers. Average the two data values in these two positions in the ordered 
data set. This is easier to understand in an example. 


Example: 
Exercise: 


Problem: 


Listed are 29 ages for Academy Award winning best actors in order from 
smallest to largest. 

NBEO 225 2052 7529 30, Bil 337 BON 74 42a) Oo) oes O25 
GanG7 Abo ele 72 ow ao. 7 


a. Find the 70" percentile. 
b. Find the 83" percentile. 


Solution: 
A ook — 70) 
o 7 =the index 
o n=29 


i= in (n+1)= (<2, )(29 + 1) = 21. Twenty-one is an integer, and the 
data value in the 21° position in the ordered data set is 64. The 70" 
percentile is 64 years. 


b. o k=83" percentile 
o ij =the index 
o n=29 


i = (n+ 1) =)#4)29 + 1) = 24.9, which is NOT an integer. 
Round it down to 24 and up to 25. The age in the 24" position is 71 
and the age in the 25" position is 72. Average 71 and 72. The 834 
percentile is 71.5 years. 


Note: 
Try It 
Exercise: 


Problem: 


Listed are 29 ages for Academy Award winning best actors in order from 
smallest to largest. 


dco feel lp PUGS Dia orl 6 ear hee) 8 Pareh Ma lle fe Sr) 0 Ryco doula! waere ewe es RW Are onal e pee 
GAO LEO 7 Ie aoe ae Oy 
Calculate the 20" percentile and the 55" percentile. 


Solution: 


k = 20. Index = i= =2-(n + 1) = 40 (29 + 1) = 6. The age in the sixth 


position is 27. The 20" percentile is 27 years. 


k= 56. Index = i= <2-(n + 1) = 2 (29 + 1) = 16.5. Round down to 16 
and up to 17. The age in the 16" position is 52 and the age in the 7 
position is 55. The average of 52 and 55 is 53.5. The 55" percentile is 53.5 


years. 


Note: 

NOTE 

You can calculate percentiles using calculators and computers. There are a 
variety of online calculators. 


A Formula for Finding the Percentile of a Value in a Data Set 


e Order the data from smallest to largest. 

e x =the number of data values counting from the bottom of the data list up to 
but not including the data value for which you want to find the percentile. 

e y =the number of data values equal to the data value for which you want to 
find the percentile. 

e n= the total number of data. 

e Calculate erty (100). Then round to the nearest integer. 


Example: 
Exercise: 


Problem: 


Listed are 29 ages for Academy Award winning best actors in order from 
smallest to largest. 

IES ee eee Para. oer sie.2 oe} bas ulna te heel oaga var” Ml anc! Waser Viens we A315 mloW cael on ove 
Oral PR Weta She peed Nina Le Waa ra bir Lea aris 


a. Find the percentile for 58. 
b. Find the percentile for 25. 


Solution: 


a. Counting from the bottom of the list, there are 18 data values less than 
58. There is one value of 58. 


x = 18 and y = 1.2+25¥ (100) = +25 (199) = 63.80. 58 is the 64" 


percentile. 
b. Counting from the bottom of the list, there are three data values less 
than 25. There is one value of 25. 


x=3andy= (| Go = oa * Eesuclels (100) = 12.07. Twenty-five is 
the 12" percentile. 


Note: 
Try It 
Exercise: 


Problem: 


Listed are 30 ages for Academy Award winning best actors in order from 
smallest to largest. 


NO eee Ge Oa Ur lee lee at) eta ek eA ey a ey iS 
G2 OA UMOO ty lets. ay ey Ae aay 


Find the percentiles for 47 and 31. 


Solution: 


Percentile for 47: Counting from the bottom of the list, there are 15 data 
values less than 47. There is one value of 47. 

Y= AS andy = 1." (100) = "=" (100) = 58.45.47 isthe ss 
percentile. 


Percentile for 31: Counting from the bottom of the list, there are eight data 
values less than 31. There are two values of 31. 

ee andy =) 200s Se aS Shs 
percentile. 


Interpreting Percentiles, Quartiles, and Median 


A percentile indicates the relative standing of a data value when data are sorted 
into numerical order from smallest to largest. Percentages of data values are less 
than or equal to the pth percentile. For example, 15% of data values are less than 
or equal to the 15" percentile. 


e Low percentiles always correspond to lower data values. 
e High percentiles always correspond to higher data values. 


A percentile may or may not correspond to a value judgment about whether it is 
"good" or "bad." The interpretation of whether a certain percentile is "good" or 
"bad" depends on the context of the situation to which the data applies. In some 
situations, a low percentile would be considered "good;" in other contexts a high 
percentile might be considered "good". In many situations, there is no value 
judgment that applies. 


Understanding how to interpret percentiles properly is important not only when 
describing data, but also when calculating probabilities in later chapters of this 
text. 


Note: 

NOTE 

When writing the interpretation of a percentile in the context of the given data, 
the sentence should contain the following information. 


e information about the context of the situation being considered 

e the data value (value of the variable) that represents the percentile 

e the percent of individuals or items with data values below the percentile 
e the percent of individuals or items with data values above the percentile. 


Example: 
Exercise: 


Problem: 


On a timed math test, the first quartile for time it took to finish the exam 
was 35 minutes. Interpret the first quartile in the context of this situation. 


Solution: 


¢ Twenty-five percent of students finished the exam in 35 minutes or 
less. 

¢ Seventy-five percent of students finished the exam in 35 minutes or 
more. 

e A low percentile could be considered good, as finishing more quickly 
on a timed exam is desirable. (If you take too long, you might not be 
able to finish.) 


Note: 
Try It 
Exercise: 


Problem: 


For the 100-meter dash, the third quartile for times for finishing the race 
was 11.5 seconds. Interpret the third quartile in the context of the situation. 


Solution: 


Twenty-five percent of runners finished the race in 11.5 seconds or more. 
Seventy-five percent of runners finished the race in 11.5 seconds or less. A 
lower percentile is good because finishing a race more quickly is desirable. 


Example: 
Exercise: 


Problem: 


On a 20 question math test, the 70" percentile for number of correct 
answers was 16. Interpret the 70" percentile in the context of this situation. 


Solution: 


e Seventy percent of students answered 16 or fewer questions correctly. 

e Thirty percent of students answered 16 or more questions correctly. 

e A higher percentile could be considered good, as answering more 
questions correctly is desirable. 


Note: 
Try It 
Exercise: 


Problem: 
On a 60 point written assignment, the 80" percentile for the number of 


points earned was 49. Interpret the 80" percentile in the context of this 
situation. 


Solution: 


Eighty percent of students earned 49 points or fewer. Twenty percent of 
students earned 49 or more points. A higher percentile is good because 
getting more points on an assignment is desirable. 


Example: 
Exercise: 


Problem: 


At acommunity college, it was found that the 30" percentile of credit units 
that students are enrolled for is seven units. Interpret the 30" percentile in 
the context of this situation. 


Solution: 


e Thirty percent of students are enrolled in seven or fewer credit units. 

e Seventy percent of students are enrolled in seven or more credit units. 
e In this example, there is no "good" or "bad" value judgment associated 
with a higher or lower percentile. Students attend community college 
for varied reasons and needs, and their course load varies according to 

their needs. 


Note: 
Try It 
Exercise: 


Problem: 


During a season, the 40" percentile for points scored per player in a game 
is eight. Interpret the 40" percentile in the context of this situation. 


Solution: 


Forty percent of players scored eight points or fewer. Sixty percent of 
players scored eight points or more. A higher percentile is good because 
getting more points in a basketball game is desirable. 


Example: 

Sharpe Middle School is applying for a grant that will be used to add fitness 
equipment to the gym. The principal surveyed 15 anonymous students to 
determine how many minutes a day the students spend exercising. The results 
from the 15 anonymous students are shown. 

0 minutes; 40 minutes; 60 minutes; 30 minutes; 60 minutes 

10 minutes; 45 minutes; 30 minutes; 300 minutes; 90 minutes; 

30 minutes; 120 minutes; 60 minutes; 0 minutes; 20 minutes 

Determine the following five values. 


e Min=0 

ae Oi 20 

e Med = 40 
Oe 0.0 

e Max = 300 


If you were the principal, would you be justified in purchasing new fitness 
equipment? Since 75% of the students exercise for 60 minutes or less daily, and 
since the IQR is 40 minutes (60 — 20 = 40), we know that half of the students 
surveyed exercise between 20 minutes and 60 minutes daily. This seems a 
reasonable amount of time spent exercising, so the principal would be justified 
in purchasing the new equipment. 

However, the principal needs to be careful. The value 300 appears to be a 
potential outlier. 

Q3 + 1.5(QR) = 60 + (1.5)(40) = 120. 

The value 300 is greater than 120 so it is a potential outlier. If we delete it and 
calculate the five values, we get the following values: 


e Min=0 
e Q, = 20 
¢ Q3 = 60 


e Max = 120 


We still have 75% of the students exercising for 60 minutes or less daily and half 
of the students exercising between 20 and 60 minutes a day. However, 15 
students is a small sample and the principal should survey more students to be 
sure of his survey results. 
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Chapter Review 


The values that divide a rank-ordered set of data into 100 equal parts are called 
percentiles. Percentiles are used to compare and interpret data. For example, an 
observation at the 50" percentile would be greater than 50 percent of the other 
obeservations in the set. Quartiles divide data into quarters. The first quartile (Q)) 
is the 25'" percentile,the second quartile (Q or median) is 50" percentile, and the 
third quartile (Q3) is the the 75™ percentile. The interquartile range, or IQR, is the 
range of the middle 50 percent of the data values. The IQR is found by 
subtracting Q, from Q3, and can help determine outliers by using the following 
two expressions. 


© Q3 + JQR(1.5) 
° Qi —IQR(1.5) 


Formula Review 

i= (4) (n +1) 

where i = the ranking or position of a data value, 
k = the kth percentile, 


n = total number of data. 
Expression for finding the percentile of a data value: (24822) (100) 


where x = the number of values counting from the bottom of the data list up to 
but not including the data value for which you want to find the percentile, 


y = the number of data values equal to the data value for which you want to find 
the percentile, 


n = total number of data 
Exercise: 


Problem: 


Listed are 29 ages for Academy Award winning best actors in order from 
smallest to largest. 


183 215:227:25:20;-277,29;.30; Bly 83° 36;.37:. 41: 42: 427: 52255; 57; 56: G2: 
64267209; 7172730 74s 6: 77 


a. Find the 40" percentile. 
b. Find the 78" percentile. 
Solution: 


a. The 40" percentile is 37 years. 
b. The 78" percentile is 70 years. 


Exercise: 


Problem: 


Listed are 32 ages for Academy Award winning best actors in order from 
smallest to largest. 


LO? 165212 22? 253-265 27° 29" 505 312 312 SB, BOM 7) oro 4 ADs 47? 522 553 
5700; 62204: 67> 695-715 J2: 732 74: 765.77 


a. Find the percentile of 37. 
b. Find the percentile of 72. 


Exercise: 


Problem: 


Jesse was ranked 37" in his graduating class of 180 students. At what 
percentile is Jesse’s ranking? 


Solution: 


Jesse graduated 37" out of a class of 180 students. There are 180 — 37 = 143 
students ranked below Jesse. There is one rank of 37. 

x= 143 andy=1. =Y)°¥ (100) = 28705 (100) = 79.72. Jesse’s rank of 37 
puts him at the 80" percentile. 


Exercise: 


Problem: 


a. For runners in a race, a low time means a faster run. The winners in a 
race have the shortest running times. Is it more desirable to have a 
finish time with a high or a low percentile when running a race? 

b. The 20" percentile of run times in a particular race is 5.2 minutes. 
Write a sentence interpreting the 20" percentile in the context of the 
situation. 

_ A bicyclist in the 90" percentile of a bicycle race completed the race in 
1 hour and 12 minutes. Is he among the fastest or slowest cyclists in the 
race? Write a sentence interpreting the 90" percentile in the context of 
the situation. 


ie) 


Exercise: 


Problem: 


a. For runners in a race, a higher speed means a faster run. Is it more 
desirable to have a speed with a high or a low percentile when running 
arace? 

b. The 40" percentile of speeds in a particular race is 7.5 miles per hour. 
Write a sentence interpreting the 40" percentile in the context of the 
situation. 


Solution: 


a. For runners in a race it is more desirable to have a high percentile for 
speed. A high percentile means a higher speed which is faster. 

b. 40% of runners ran at speeds of 7.5 miles per hour or less (slower). 
60% of runners ran at speeds of 7.5 miles per hour or more (faster). 


Exercise: 


Problem: 


On an exam, would it be more desirable to earn a grade with a high or low 
percentile? Explain. 


Exercise: 


Problem: 


Mina is waiting in line at the Department of Motor Vehicles (DMV). Her 
wait time of 32 minutes is the 85" percentile of wait times. Is that good or 
bad? Write a sentence interpreting the 85" percentile in the context of this 
situation. 


Solution: 


When waiting in line at the DMV, the 85" percentile would be a long wait 
time compared to the other people waiting. 85% of people had shorter wait 
times than Mina. In this context, Mina would prefer a wait time 
corresponding to a lower percentile. 85% of people at the DMV waited 32 
minutes or less. 15% of people at the DMV waited 32 minutes or longer. 


Exercise: 


Problem: 


In a survey collecting data about the salaries earned by recent college 
graduates, Li found that her salary was in the 78" percentile. Should Li be 
pleased or upset by this result? Explain. 


Exercise: 


Problem: 


In a study collecting data about the repair costs of damage to automobiles in 
a certain type of crash tests, a certain model of car had $1,700 in damage 
and was in the 90" percentile. Should the manufacturer and the consumer be 
pleased or upset by this result? Explain and write a sentence that interprets 
the 90" percentile in the context of this problem. 


Solution: 


The manufacturer and the consumer would be upset. This is a large repair 
cost for the damages, compared to the other cars in the sample. 
INTERPRETATION: 90% of the crash tested cars had damage repair costs 
of $1700 or less; only 10% had damage repair costs of $1700 or more. 


Exercise: 


Problem: 


The University of California has two criteria used to set admission standards 
for freshman to be admitted to a college in the UC system: 


a. Students' GPAs and scores on standardized tests (SATs and ACTs) are 
entered into a formula that calculates an "admissions index" score. The 
admissions index score is used to set eligibility standards intended to 
meet the goal of admitting the top 12% of high school students in the 
state. In this context, what percentile does the top 12% represent? 

b. Students whose GPAs are at or above the 96" percentile of all students 
at their high school are eligible (called eligible in the local context), 
even if they are not in the top 12% of all students in the state. What 
percentage of students from each high school are "eligible in the local 
context"? 


Exercise: 
Problem: 
Suppose that you are buying a house. You and your realtor have determined 
that the most expensive house you can afford is the 34" percentile. The 34 


percentile of housing prices is $240,000 in the town you want to move to. In 
this town, can you afford 34% of the houses or 66% of the houses? 


Solution: 


You can afford 34% of houses. 66% of the houses are too expensive for your 
budget. INTERPRETATION: 34% of houses cost $240,000 or less. 66% of 
houses cost $240,000 or more. 


Use the following information to answer the next six exercises. Sixty-five 
randomly selected car salespersons were asked the number of cars they generally 
sell in one week. Fourteen people answered that they generally sell three cars; 
nineteen generally sell four cars; twelve generally sell five cars; nine generally 
sell six cars; eleven generally sell seven cars. 

Exercise: 


Problem: First quartile = 


Exercise: 


Problem: Second quartile = median = 50" percentile = 


Solution: 


4 


Exercise: 


Problem: Third quartile = 


Exercise: 


Problem: Interquartile range JQR)=__ ss — = 


Solution: 


6-4=2 


Exercise: 


Problem: 10" percentile = 


Exercise: 


Problem: 70" percentile = 


Solution: 


6 


Homework 


Exercise: 


Problem: 


The median age for U.S. blacks currently is 30.9 years; for U.S. whites it is 
42.3 years. 


a. Based upon this information, give two reasons why the black median 
age could be lower than the white median age. 

b. Does the lower median age for blacks necessarily mean that blacks die 
younger than whites? Why or why not? 

c. How might it be possible for blacks and whites to die at approximately 
the same age, but for the median age for whites to be higher? 


Exercise: 
Problem: 
Six hundred adult Americans were asked by telephone poll, "What do you 


think constitutes a middle-class income?" The results are in [link]. Also, 
include left endpoint, but not the right endpoint. 


Salary ($) Relative Frequency 


< 20,000 0.02 
20,000—25,000 0.09 
25,000—30,000 0.19 
30,000—40,000 0.26 
40,000—50,000 0.18 
50,000—75,000 0.17 
75,000—99,999 0.02 
100,000+ 0.01 


a. What percentage of the survey answered "not sure"? 
b. What percentage think that middle-class is from $25,000 to $50,000? 
c. Construct a histogram of the data. 


i. Should all bars have the same width, based on the data? Why or 
why not? 

ii. How should the <20,000 and the 100,000+ intervals be handled? 
Why? 


d. Find the 40" and 80" percentiles 
e. Construct a bar graph of the data 


Solution: 


a. 1 — (0.02+0.09+0.19+0.26+0.18+0.17+0.02+0.01) = 0.06 
b. 0.19+0.26+0.18 = 0.63 
c. Check student’s solution. 


d. 40% percentile will fall between 30,000 and 40,000 


go percentile will fall between 50,000 and 75,000 


e. Check student’s solution. 


Exercise: 


Problem: Given the following box plot: 


0 2 10 12 13 


a. which quarter has the smallest spread of data? What is that spread? 
b. which quarter has the largest spread of data? What is that spread? 
c. find the interquartile range (IQR). 
d. are there more data in the interval 5—10 or in the interval 10-13? How 
do you know this? 
e. which interval has the fewest data in it? How do you know this? 
i. 0-2 

li. 2-4 

iii. 10-12 

iv. 12-13 

v. need more information 


Exercise: 


Problem: 


The following box plot shows the U.S. population for 1990, the latest 
available year. 


0 17 33 50 =105 
a. Are there fewer or more children (age 17 and under) than senior 
citizens (age 65 and over)? How do you know? 


b. 12.6% are age 65 and over. Approximately what percentage of the 
population are working age adults (above age 17 to age 65)? 


Solution: 


a. more children; the left whisker shows that 25% of the population are 
children 17 and younger. The right whisker shows that 25% of the 
population are adults 50 and older, so adults 65 and over represent less 
than 25%. 

b. 62.4% 


Glossary 


Interquartile Range 
or IQR, is the range of the middle 50 percent of the data values; the IQR is 
found by subtracting the first quartile from the third quartile. 


Outlier 
an observation that does not fit the rest of the data 


Percentile 
a number that divides ordered data into hundredths; percentiles may or may 
not be part of the data. The median of the data is the second quartile and the 
50" percentile. The first and third quartiles are the 25" and the 75" 
percentiles, respectively. 


Quartiles 
the numbers that separate the data into quarters; quartiles may or may not be 
part of the data. The second quartile is the median of the data. 


Box Plots 


Box plots (also called box-and-whisker plots or box-whisker plots) give a 
good graphical image of the concentration of the data. They also show how 
far the extreme values are from most of the data. A box plot is constructed 
from five values: the minimum value, the first quartile, the median, the third 
quartile, and the maximum value. We use these values to compare how 
close other data values are to them. 


To construct a box plot, use a horizontal or vertical number line and a 
rectangular box. The smallest and largest data values label the endpoints of 
the axis. The first quartile marks one end of the box and the third quartile 
marks the other end of the box. Approximately the middle 50 percent of 
the data fall inside the box. The "whiskers" extend from the ends of the 
box to the smallest and largest data values. The median or second quartile 
can be between the first and third quartiles, or it can be one, or the other, or 
both. The box plot gives a good, quick picture of the data. 


Note: 

NOTE 

You may encounter box-and-whisker plots that have dots marking outlier 
values. In those cases, the whiskers are not extending to the minimum and 
maximum values. 


Consider, again, this dataset. 
11.224668.7.2:886.39 1010 11.5 


The first quartile is two, the median is seven, and the third quartile is nine. 
The smallest value is one, and the largest value is 11.5. The following 
image shows the constructed box plot. 


Note: 
NOTE 
See the calculator instructions on the T]_ web site or in the appendix. 


ogg EE ae 


+ oe oe i te 
i 2 3 4 5 6 7 8 9 10 11 11.5 


The two whiskers extend from the first quartile to the smallest value and 
from the third quartile to the largest value. The median is shown with a 
dashed line. 


Note: 

NOTE 

It is important to start a box plot with a scaled number line. Otherwise the 
box plot may not be useful. 


Example: 

The following data are the heights of 40 students in a statistics class. 

59 60 61 62 62 63 63 64 64 64 65 65 65 65 65 65 65 65 65 66 66 67 67 68 
68 69 70 70 70 70 70 71 71 72 72 73 74 7475 77 

Construct a box plot with the following properties; the calculator 
intructions for the minimum and maximum values as well as the quartiles 
follow the example. 


e Minimum value = 59 

e Maximum value = 77 

e Q1: First quartile = 64.5 

e Q2: Second quartile or median= 66 
¢ Q3: Third quartile = 70 


-t—_—Jo o4o4_t-- st 
59 64.5 66 70 77 


a. Each quarter has approximately 25% of the data. 

b. The spreads of the four quarters are 64.5 — 59 = 5.5 (first quarter), 66 
— 64.5 = 1.5 (second quarter), 70 — 66 = 4 (third quarter), and 77 — 70 
= 7 (fourth quarter). So, the second quarter has the smallest spread 
and the fourth quarter has the largest spread. 

. Range = maximum value — the minimum value = 77 — 59 = 18 

. Interquartile Range: JQR = Q3 — Q1 = 70 — 64.5 = 5.5. 

e. The interval 59-65 has more than 25% of the data so it has more data 

in it than the interval 66 through 70 which has 25% of the data. 

f. The middle 50% (middle half) of the data has a range of 5.5 inches. 


Slane 


Note: 

To find the minimum, maximum, and quartiles: 

Enter data into the list editor (Pres STAT 1:EDIT). If you need to clear the 
list, arrow up to the name L1, press CLEAR, and then arrow down. 
Put the data values into the list L1. 

Press STAT and arrow to CALC. Press 1:1-VarStats. Enter L1. 
Press ENTER. 

Use the down and up arrow keys to scroll. 

Smallest value = 59. 

Largest value = 77. 

Q,: First quartile = 64.5. 

Q>: Second quartile or median = 66. 

Q3: Third quartile = 70. 


To construct the box plot: 

Press 4:Plotsoff. Press ENTER. 

Arrow down and then use the right arrow key to go to the fifth picture, 
which is the box plot. Press ENTER. 

Arrow down to Xlist: Press 2nd 1 for L1 


Arrow down to Freq: Press ALPHA. Press 1. 
Press Zoom. Press 9: ZoomStat. 
Press TRACE, and use the arrow keys to examine the box plot. 


Note: 
Try It 
Exercise: 


Problem: 


The following data are the number of pages in 40 books on a shelf. 
Construct a box plot using a graphing calculator, and state the 
interquartile range. 


136 140 178 190 205 215 217 218 232 234 240 255 270 275 290 301 
303 315 317 318 326 333 343 349 360 369 377 388 391 392 398 400 
402 405 408 422 429 450 475 512 


Solution: 


—_—— ini 


120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 420 440 460 480 500 520 540 


IQR = 158 


For some sets of data, some of the largest value, smallest value, first 
quartile, median, and third quartile may be the same. For instance, you 
might have a data set in which the median and the third quartile are the 
same. In this case, the diagram would not have a dotted line inside the box 
displaying the median. The right side of the box would display both the 
third quartile and the median. For example, if the smallest value and the 
first quartile were both one, the median and the third quartile were both 
five, and the largest value was seven, the box plot would look like: 


1 Zz 3 At 5 6 7 


In this case, at least 25% of the values are equal to one. Twenty-five percent 
of the values are between one and five, inclusive. At least 25% of the values 
are equal to five. The top 25% of the values fall between five and seven, 
inclusive. 


Example: 

Test scores for a college statistics class held during the day are: 

99 56 78 55.5 32 90 80 81 56 59 45 77 84.5 84 70 72 68 32 79 90 

Test scores for a college statistics class held during the evening are: 

98 78 68 83 81 89 88 76 65 45 98 90 80 84.5 85 79 78 98 90 79 81 25.5 
Exercise: 


Problem: 


a. Find the smallest and largest values, the median, and the first and 
third quartile for the day class. 

b. Find the smallest and largest values, the median, and the first and 
third quartile for the night class. 

c. For each data set, what percentage of the data is between the 
smallest value and the first quartile? the first quartile and the 
median? the median and the third quartile? the third quartile and 
the largest value? What percentage of the data is between the 
first quartile and the largest value? 

d. Create a box plot for each set of data. Use one number line for 
both box plots. 

e. Which box plot has the widest spread for the middle 50% of the 
data (the data between the first and third quartiles)? What does 
this mean for that set of data in comparison to the other set of 
data? 


Solution: 


o Min = 32 
OO 50 

o0 M=74.5 
° Q,= 82.5 
o Max = 99 
o Min = 25.5 
oe Oia ke: 

o M=81 
ons = oo 

o Max = 98 


c. Day class: There are six data values ranging from 32 to 56: 30%. 


d 


There are six data values ranging from 56 to 74.5: 30%. There 
are five data values ranging from 74.5 to 82.5: 25%. There are 
five data values ranging from 82.5 to 99: 25%. There are 16 data 
values between the first quartile, 56, and the largest value, 99: 
75%. Night class: 


20 30 40 50 60 70 80 90 100 


e. The first data set has the wider spread for the middle 50% of the 


Note: 
Try It 


data. The JQR for the first data set is greater than the JQR for the 
second set. This means that there is more variability in the 
middle 50% of the first data set. 


Exercise: 


Problem: 


The following data set shows the heights in inches for the boys ina 
class of 40 students. 


66; 66:'67: 67: 6S: 68; 68; 68; 68; 69: 69: G9: 70; 71: 72; 72; 72: 73: 
73, 74 

The following data set shows the heights in inches for the girls in a 
class of 40 students. 

61; 61; 62; 62; 63; 63: 63; G5; 65; 65; G6; 66; G6; 67; 68; 62: 63; 69; 
69; 69 

Construct a box plot using a graphing calculator for each data set, and 
state which box plot has the wider spread for the middle 50% of the 
data. 


Solution: 
Heights of boys 


— hh 


Heights of girls 


60 61 62 63 64 65 66 67 68 69 70 7/1 72 73 74 75 76 


IQR for the boys = 4 
IQR for the girls = 5 


The box plot for the heights of the girls has the wider spread for the 
middle 50% of the data. 


Example: 

Graph a box-and-whisker plot for the data values shown. 
1010101535759095100175420490515515790 

The five numbers used to create a box-and-whisker plot are: 


Min: 10 
Opks 
Med: 95 
Q3: 490 

e Max: 790 


The following graph shows the box-and-whisker plot. 


10 15 95 490 790 


Note: 
Try It 
Exercise: 


Problem: 


Follow the steps you used to graph a box-and-whisker plot for the 
data values shown. 


0551530304550506075110140240330 
Solution: 


The data are in order from least to greatest. There are 15 values, so the 
eighth number in order is the median: 50. There are seven data values 
written to the left of the median and 7 values to the right. The five 
values that are used to create the boxplot are: 


e Min: 0 

e Q,:15 

e Med: 50 
e Qs: 110 
e Max: 330 


References 
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Chapter Review 


Box plots are a type of graph that can help visually organize data. To graph 
a box plot the following data points must be calculated: the minimum value, 
the first quartile, the median, the third quartile, and the maximum value. 
Once the box plot is graphed, you can display and compare distributions of 
data. 


Use the following information to answer the next two exercises. Sixty-five 
randomly selected car salespersons were asked the number of cars they 
generally sell in one week. Fourteen people answered that they generally 
sell three cars; nineteen generally sell four cars; twelve generally sell five 
cars; nine generally sell six cars; eleven generally sell seven cars. 
Exercise: 


Problem: 
Construct a box plot below. Use a ruler to measure and scale 
accurately. 
Exercise: 
Problem: 
Looking at your box plot, does it appear that the data are concentrated 


together, spread out evenly, or concentrated in some areas, but not in 
others? How can you tell? 


Solution: 


More than 25% of salespersons sell four cars in a typical week. You 
can see this concentration in the box plot because the first quartile is 
equal to the median. The top 25% and the bottom 25% are spread out 
evenly; the whiskers have the same length. 


Homework 


Exercise: 


Problem: 


In a survey of 20-year-olds in China, Germany, and the United States, 
people were asked the number of foreign countries they had visited in 


their lifetime. The following box plots display the results. 
China 


Germany 


United States 


a. In complete sentences, describe what the shape of each box plot 
implies about the distribution of the data collected. 

b. Have more Americans or more Germans surveyed been to over 
eight foreign countries? 

c. Compare the three box plots. What do they imply about the 
foreign travel of 20-year-old residents of the three countries when 
compared to each other? 


Exercise: 


Problem: Given the following box plot, answer the questions. 


a. Think of an example (in words) where the data might fit into the 
above box plot. In 2—5 sentences, write down the example. 

b. What does it mean to have the first and second quartiles so close 
together, while the second to third quartiles are far apart? 


Solution: 


a. Answers will vary. Possible answer: State University conducted a 
survey to see how involved its students are in community service. 
The box plot shows the number of community service hours 
logged by participants over the past year. 

b. Because the first and second quartiles are close, the data in this 
quarter is very similar. There is not much variation in the values. 
The data in the third quarter is much more variable, or spread out. 
This is clear because the second quartile is so far away from the 
third quartile. 


Exercise: 


Problem: Given the following box plots, answer the questions. 
Data 1 


a. In complete sentences, explain why each statement is false. 


i. Data 1 has more data values above two than Data 2 has 
above two. 
ii. The data sets cannot have the same mode. 
iii. For Data 1, there are more data values below four than there 
are above four. 


b. For which group, Data 1 or Data 2, is the value of “7” more likely 
to be an outlier? Explain why in complete sentences. 


Exercise: 


Problem: 


A survey was conducted of 130 purchasers of new BMW 3 series cars, 
130 purchasers of new BMW 5 series cars, and 130 purchasers of new 
BMW 7 series cars. In it, people were asked the age they were when 
they purchased their car. The following box plots display the results. 


BMW 3 series 
BMW 5 series 


BMW 7 series 


a. In complete sentences, describe what the shape of each box plot 
implies about the distribution of the data collected for that car 
series. 

b. Which group is most likely to have an outlier? Explain how you 
determined that. 

c. Compare the three box plots. What do they imply about the age of 
purchasing a BMW from the series when compared to each other? 

d. Look at the BMW 5 series. Which quarter has the smallest spread 
of data? What is the spread? 


e. Look at the BMW 5 series. Which quarter has the largest spread 
of data? What is the spread? 
. Look at the BMW 5 series. Estimate the interquartile range 
(IQR). 
g. Look at the BMW 5 series. Are there more data in the interval 31 
to 38 or in the interval 45 to 55? How do you know this? 
h. Look at the BMW 5 series. Which interval has the fewest data in 
it? How do you know this? 


is 


Lo1=35 
ii. 38-41 
il. 41-64 


Solution: 


a. Each box plot is spread out more in the greater values. Each plot 
is skewed to the right, so the ages of the top 50% of buyers are 
more variable than the ages of the lower 50%. 

b. The BMW 3 series is most likely to have an outlier. It has the 
longest whisker. 

c. Comparing the median ages, younger people tend to buy the 
BMW 3 series, while older people tend to buy the BMW 7 series. 
However, this is not a rule, because there is so much variability in 
each data set. 

d. The second quarter has the smallest spread. There seems to be 
only a three-year difference between the first quartile and the 
median. 

e. The third quarter has the largest spread. There seems to be 
approximately a 14-year difference between the median and the 
third quartile. 

. [QR ~ 17 years 

g. There is not enough information to tell. Each interval lies within a 
quarter, so we cannot tell exactly where the data in that quarter is 
concentrated. 

h. The interval from 31 to 35 years has the fewest data values. 
Twenty-five percent of the values fall in the interval 38 to 41, and 


Pr: 


25% fall between 41 and 64. Since 25% of values fall between 31 
and 38, we know that fewer than 25% fall between 31 and 35. 


Exercise: 
Problem: 


Twenty-five randomly selected students were asked the number of 
movies they watched the previous week. The results are as follows: 


# of movies Frequency 
0 5 
1 9 
2 6 
3 4 
4 1 


Construct a box plot of the data. 


Bringing It Together 


Exercise: 


Problem: 


Santa Clara County, CA, has approximately 27,873 Japanese- 
Americans. Their ages are as follows: 


Age Group Percent of Community 


0-17 18.9 
18-24 8.0 

25-34 22.8 
35-44 15.0 
45-54 13.1 
55-64 11.9 
65+ 10.3 


a. Construct a histogram of the Japanese-American community in 
Santa Clara County, CA. The bars will not be the same width for 
this example. Why not? What impact does this have on the 
reliability of the graph? 

b. What percentage of the community is under age 35? 

c. Which box plot most resembles the information above? 


0 24 25 54 =100 


Solution: 


a. For graph, check student's solution. 

b. 49.7% of the community is under the age of 35. 

c. Based on the information in the table, graph (a) most closely 
represents the data. 


Glossary 


Box plot 
a graph that gives a quick picture of the middle 50% of the data 


First Quartile 
the value that is the median of the of the lower half of the ordered data 
set 


Frequency Polygon 
looks like a line graph but uses intervals to display ranges of large 
amounts of data 


Interval 
also called a class interval; an interval represents a range of data and is 
used when displaying large data sets 


Paired Data Set 
two data sets that have a one to one relationship so that: 


e both data sets are the same size, and 
e each data point in one data set is matched with exactly one point 
from the other set. 


Skewed 
used to describe data that is not symmetrical; when the right side of a 
graph looks “chopped off” compared the left side, we say it is “skewed 
to the left.” When the left side of the graph looks “chopped off” 
compared to the right side, we say the data is “skewed to the right.” 
Alternatively: when the lower values of the data are more spread out, 
we say the data are skewed to the left. When the greater values are 
more spread out, the data are skewed to the right. 


Measures of the Center of the Data 


The "center" of a data set is also a way of describing location. The two most widely used measures of the 
"center" of the data are the mean (average) and the median. To calculate the mean weight of 50 people, 
add the 50 weights together and divide by 50. To find the median weight of the 50 people, order the data 
and find the number that splits the data into two equal parts. The median is generally a better measure of 
the center when there are extreme values or outliers because it is not affected by the precise numerical 
values of the outliers. The mean is the most common measure of the center. 


Note: 

NOTE 

The words “mean” and “average” are often used interchangeably. The substitution of one word for the 
other is common practice. The technical term is “arithmetic mean” and “average” is technically a center 
location. However, in practice among non-statisticians, “average” is commonly accepted for “arithmetic 
mean.” 


When each value in the data set is not unique, the mean can be calculated by multiplying each distinct 
value by its frequency and then dividing the sum by the total number of data values. The letter used to 
represent the sample mean is an x with a bar over it (pronounced “x bar”): z. 


The Greek letter : (pronounced "mew") represents the population mean. One of the requirements for the 
sample mean to be a good estimate of the population mean is for the sample taken to be truly random. 


To see that both ways of calculating the mean are the same, consider the sample: 
1; 1; 1; 2; 2; 3; 4; 4; 4; 4; 4 
Equation: 


Equation: 


_ 3(1) + 2(2) + 1(8) +5(4) 
= is 


= 2.7 


In the second calculation, the frequencies are 3, 2, 1, and 5. 


: : : : : : 41 
You can quickly find the location of the median by using the expression >. 


The letter n is the total number of data values in the sample. If n is an odd number, the median is the 
middle value of the ordered data (ordered smallest to largest). If n is an even number, the median is equal 
to the two middle values added together and divided by two after the data has been ordered. For example, 
if the total number of data values is 97, then nt S at = 49. The median is the 49" value in the 


n+1— 100+1 
2 2 


ordered data. If the total number of data values is 100, then = 50.5. The median occurs 


midway between the 50" and 51 values. The location of the median and the value of the median are not 
the same. The upper case letter M is often used to represent the median. The next example illustrates the 
location of the median and the value of the median. 


Example: 
Exercise: 


Problem: 


AIDS data indicating the number of months a patient with AIDS lives after taking a new antibody 
drug are as follows (smallest to largest): 

op ale (op (33 IOP Wile ie iss 14s Se Se ies ilee 72 i772 ee Wile Bas wos dale Dale Usp JAse Dee LIS Aa7/e Dag 
Age Bile Swe Biss sis sale syle Bisp 37s ald e alae alae aly 

Calculate the mean and the median. 


Solution: 


The calculation for the mean is: 


— [8+44+(8)(2)+10+11+12+13+14+4 (15)(2)+(16)(2)+...4+35+37+40-+ (44)(2)+47] __ 23.6 
= 40 en 
To find the median, M, first use the formula for the location. The location is: 


n+1 _. 40+1 __ 
2 Se os 


Starting at the smallest value, the median is located between the 20" and 21“ values (the two 24s): 
ae ale fap tele Op ilile We ise 4s Wise ise ee iee ys 72 Iie ile Be whe Dale Dale Wse Msp Wee Lye Lye Bee 
gp Bile Swe se Sse sale syle Bise 37s Ade alae alae aly 


M= aes — 94 


Note: 

To find the mean and the median: 

Clear list L1. Pres STAT 4:ClrList. Enter 2nd 1 for list L1. Press ENTER. 

Enter data into the list editor. Press STAT 1:EDIT. 

Put the data values into list L1. 

Press STAT and arrow to CALC. Press 1:1-VarStats. Press 2nd 1 for L1 and then ENTER. 
Press the down and up arrow keys to scroll. 

x = 23.6, M = 24 


Note: 
Try It 
Exercise: 


Problem: 


The following data show the number of months patients typically wait on a transplant list before 
getting surgery. The data are ordered from smallest to largest. Calculate the mean and median. 


pAb) YW HT ce} teh S)S) AMO) i108) iM) NC) IO) ahah La dae Les} eal a4) as) ALS) LZ LZ Alfa} IMS) iS) aS) 2a iL ak 22 38} A! 
24 24 24 


Solution: 


IMi@aIne 3) ar Gb cess) se Y/ ae 7 ae ae 7 sets) ar fe) ce S) ae @) ae NO) ae 110) se 11) se Ose 10) te TG se 22 ae i te a8) ae aval 
ae Wélae 5) ae ae 1 ae 1 7/ ae its} se 1G) ae 1G) ee IG )ae Dil sp Dil ae DP ap WD sp Des ae Wal sp Dal ee Dal [ByVal 
SS oe 

Median: Starting at the smallest value, the median is the 20th term, which is 13. 


Example: 
Exercise: 


Problem: 


Suppose that in a small town of 50 people, one person earns $5,000,000 per year and the other 49 
each earn $30,000. Which is the better measure of the "center": the mean or the median? 


Solution: 
tee Soa ee ON) = 129,400 
M = 30,000 


(There are 49 people who earn $30,000 and one person who earns $5,000,000.) 


The median is a better measure of the "center" than the mean because 49 of the values are 30,000 
and one is 5,000,000. The 5,000,000 is an outlier. The 30,000 gives us a better sense of the middle 
of the data. 


Note: 
Try It 
Exercise: 


Problem: 


In a sample of 60 households, one house is worth $2,500,000. Half of the rest are worth $280,000, 
and all the others are worth $315,000. Which is the better measure of the “center”: the mean or the 
median? 


Solution: 


The median is the better measure of the “center” than the mean because 59 of the values are 
$280,000 and one is $2,500,000. The $2,500,000 is an outlier. Either $280,000 or $315,000 gives us 
a better sense of the middle of the data. 


Another measure of the center is the mode. The mode is the most frequent value. There can be more than 
one mode in a data set as long as those values have the same frequency and that frequency is the highest. 
A data set with two modes is called bimodal. 


Example: 

Statistics exam scores for 20 students are as follows: 
5053595963637272727272767881838484849093 
Exercise: 


Problem: Find the mode. 
Solution: 


The most frequent score is 72, which occurs five times. Mode = 72. 


Note: 
Try It 
Exercise: 


Problem:The number of books checked out from the library from 25 students are as follows: 


0001233445577778889101011111212 
Find the mode. 


Solution: 


The most frequent number of books is 7, which occurs four times. Mode = 7. 


Example: 

Five real estate exam scores are 430, 430, 480, 480, 495. The data set is bimodal because the scores 430 
and 480 each occur twice. 

When is the mode the best measure of the "center"? Consider a weight loss program that advertises a 
mean weight loss of six pounds the first week of the program. The mode might indicate that most people 
lose two pounds the first week, making the program less appealing. 


Note: 

NOTE 

The mode can be calculated for qualitative data as well as for quantitative data. For example, if the data 
set is: red, red, red, green, green, yellow, purple, black, blue, the mode is red. 


Statistical software will easily calculate the mean, the median, and the mode. Some graphing calculators 
can also make these calculations. In the real world, people make these calculations using software. 


Note: 
Try It 


Exercise: 


Problem: 


Five credit scores are 680, 680, 700, 720, 720. The data set is bimodal because the scores 680 and 
720 each occur twice. Consider the annual earnings of workers at a factory. The mode is $25,000 
and occurs 150 times out of 301. The median is $50,000 and the mean is $47,500. What would be 
the best measure of the “center”? 


Solution: 


Because $25,000 occurs nearly half the time, the mode would be the best measure of the center 
because the median and mean don’t represent what most people make at the factory. 


The Law of Large Numbers and the Mean 


The Law of Large Numbers says that if you take samples of larger and larger size from any population, 
then the mean z of the sample is very likely to get closer and closer to p. This is discussed in more detail 
later in the text. 


Sampling Distributions and Statistic of a Sampling Distribution 


You can think of a sampling distribution as a relative frequency distribution with a great many 
samples. (See Sampling and Data for a review of relative frequency). Suppose thirty randomly selected 
students were asked the number of movies they watched the previous week. The results are in the 
relative frequency table shown below. 


# of movies Relative Frequency 

0 oe 
30 
15 

1 = 
30 
6 

2. Pama 


# of movies Relative Frequency 


3 a 
30 

; 1 
30 


If you let the number of samples get very large (say, 300 million or more), the relative frequency 
table becomes a relative frequency distribution. 


A statistic is a number calculated from a sample. Statistic examples include the mean, the median and the 
mode as well as others. The sample mean z is an example of a statistic which estimates the population 
mean [H. 


Calculating the Mean of Grouped Frequency Tables 


When only grouped data is available, you do not know the individual data values (we only know intervals 
and interval frequencies); therefore, you cannot compute an exact mean for the data set. What we must do 
is estimate the actual mean by calculating the mean of a frequency table. A frequency table is a data 
representation in which grouped data is displayed along with the corresponding frequencies. To calculate 
the mean from a grouped frequency table we can apply the basic definition of mean: mean = 


saumb ae calucs We simply need to modify the definition to fit within the restrictions of a frequency 


table. 


Since we do not know the individual data values we can instead find the midpoint of each interval. The 
F -__. lower boundary+upper boundary 
midpoint is ————_ 


5 . We can now modify the mean definition to be 


Mean of Frequency Table = — where f = the frequency of the interval and m = the midpoint of 


the interval. 


Example: 
Exercise: 


Problem: 


A frequency table displaying professor Blount’s last statistic test is shown. Find the best estimate of 
the class mean. 


Grade Interval Number of Students 


Grade Interval 


Number of Students 


50-56.5 1 
56.5-62.5 0 
62.5-68.5 4 
68.5—74.5 4 
74.5-80.5 2 
80.5—-86.5 3 
86.5-92.5 4 
92.5-98.5 1 

Solution: 
e Find the midpoints for all intervals 
Grade Interval Midpoint 
50-56.5 53.25 
56.5-62.5 59.5 
62.5-68.5 65.5 
68.5—74.5 71.5 
74.5-80.5 77.5 
80.5—-86.5 83.5 
86.5—-92.5 89.5 
92.5-98.5 95.5 
e Calculate the sum of the product of each interval frequency and midpoint. fm 


53.25(1) + 59.5(0) + 65.5(4) + 71.5(4) + 77.5(2) + 83.5(3) + 89.5(4) + 95.5(1) = 1460.25 


fm 1460.25 
= = 9 = 76.86 


Note: 
Try It 
Exercise: 


Problem: 


Maris conducted a study on the effect that playing video games has on memory recall. As part of 
her study, she compiled the following data: 


Hours Teenagers Spend on Video Games Number of Teenagers 
0-3.5 3 

3.5-7.5 iy 

7.9-11.5 1 

11.5-15.5 7 

15.5-19.5 9 


What is the best estimate for the mean number of hours spent playing video games? 
Solution: 


Find the midpoint of each interval, multiply by the corresponding number of teenagers, add the 

results and then divide by the total number of teenagers 

The midpoints are 1.75, 5.5, 9.5, 13.5,17.5. 

(1.75)(3) + (5.5)(7) + (9.5)(12) + (13.5)(7) + (17.5)(9) 
(3+74+12+7+9) 


te) _ 409.75 _ 
Mean = =-3 C= 10.78 
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Chapter Review 


The mean and the median can be calculated to help you find the "center" of a data set. The mean is the 
best estimate for the actual data set, but the median is the best measurement when a data set contains 
several outliers or extreme values. The mode will tell you the most frequently occuring datum (or data) in 
your data set. The mean, median, and mode are extremely helpful when you need to analyze your data, 
but if your data set consists of ranges which lack specific values, the mean may seem impossible to 
calculate. However, the mean can be approximated if you add the lower boundary with the upper 
boundary and divide by two to find the midpoint of each interval. Multiply each midpoint by the number 
of values found in the corresponding range. Divide the sum of these values by the total number of data 
values in the set. 


Formula Review 
b= — Where f = interval frequencies and m = interval midpoints. 


Exercise: 


Problem: Find the mean for the following frequency tables. 


a. Grade Frequency 
49.5-59.5 2 
59.5-69.5 3 
69.5-79.5 8 
79.5-89.5 12 
89.5-99.5 5 

b. Daily Low Temperature Frequency 
49.5-59.5 53 
59.5-69.5 32 
69.5-79.5 15 


79.5-89.5 1 


Daily Low Temperature Frequency 


89.5-99.5 0 

c. Points per Game Frequency 
49.5-59.5 14 
59.5-69.5 32 
69.5-79.5 15 
79.5-89.5 23 
89.5-99.5 2 


Use the following information to answer the next three exercises: The following data show the lengths of 
boats moored in a marina. The data are ordered from smallest to largest: 
161719202021232425252526262727272829303233333435373940 

Exercise: 


Problem: Calculate the mean. 


Solution: 


Mean: 16+ 17+ 19+ 20+ 20+ 21+ 23+ 24+ 25+ 25+ 25+ 26+ 264+ 27+ 27+ 27+ 28+ 29 + 
30 + 32 + 33 + 33 + 34+ 35 + 37 + 39 + 40 = 738; 


738 _ 
BS = 27.33 


Exercise: 


Problem: Identify the median. 


Exercise: 


Problem: Identify the mode. 


Solution: 


The most frequent lengths are 25 and 27, which occur three times. Mode = 25, 27 


Use the following information to answer the next three exercises: Sixty-five randomly selected car 


salespersons were asked the number of cars they generally sell in one week. Fourteen people answered 
that they generally sell three cars; nineteen generally sell four cars; twelve generally sell five cars; nine 
generally sell six cars; eleven generally sell seven cars. Calculate the following: 

Exercise: 


Problem: sample mean = x = 
Exercise: 


Problem: median = 


Solution: 
4 


Exercise: 


Problem: mode = 


Homework 


Exercise: 


Problem: 


The most obese countries in the world have obesity rates that range from 11.4% to 74.6%. This data 
is summarized in the following table. 


Percent of Population Obese Number of Countries 
11.4—20.45 29 

20.45—29.45 13 

29.45—38.45 4 

38.45—47.45 0 

47.45-56.45 pi 

56.45-65.45 1 

65.45—74.45 0 


74.45-83.45 1 


a. What is the best estimate of the average obesity percentage for these countries? 
b. The United States has an average obesity rate of 33.9%. Is this rate above average or below? 
c. How does the United States compare to other countries? 


Exercise: 
Problem: 


[link] gives the percent of children under five considered to be underweight. What is the best 
estimate for the mean percentage of underweight children? 


Percent of Underweight Children Number of Countries 
16—21.45 23 

21.45-26.9 4 

26.9-32.35 ) 

32.35-37.8 7 

37.8-43.25 6 

43.25-48.7 1 

Solution: 


_— 1328.65 _ 
The mean percentage, = “<5 = 26.75 


Bringing It Together 


Exercise: 


Problem: 


Javier and Ercilia are supervisors at a shopping mall. Each was given the task of estimating the mean 
distance that shoppers live from the mall. They each randomly surveyed 100 shoppers. The samples 
yielded the following information. 


Javier Ercilia 


Javier Ercilia 
x 6.0 miles 6.0 miles 


s 4.0 miles 7.0 miles 


a. How can you determine which survey was correct ? 

b. Explain what the difference in the results of the surveys implies about the data. 

c. If the two histograms depict the distribution of values for each supervisor, which one depicts 
Ercilia's sample? How do you know? 


6 6 
(a) (b) 


d. If the two box plots depict the distribution of values for each supervisor, which one depicts 
Ercilia’s sample? How do you know? 


o1 6 14 21 0 4 6 9 12 


Use the following information to answer the next three exercises: We are interested in the number of 
years students in a particular elementary statistics class have lived in California. The information in the 
following table is from the entire section. 


Number of years Frequency Number of years Frequency 
7 1 22 1 

14 3 23 1 

15 1 26 1 

18 1 40 2 

19 4 42 2 

20 3 


Total = 20 


Exercise: 


Problem: What is the IQR? 


Solution: 
a 
Exercise: 
Problem: What is the mode? 


a. 19 

b. 19.5 

c. 14 and 20 
d. 22.65 


Exercise: 


Problem: Is this a sample or the entire population? 


a. sample 
b. entire population 
c. neither 


Solution: 


b 


Glossary 


Frequency Table 
a data representation in which grouped data is displayed along with the corresponding frequencies 


Mean 
a number that measures the central tendency of the data; a common name for mean is ‘average.’ The 


term 'mean' is a shortened form of ‘arithmetic mean.' By definition, the mean for a sample (denoted 
Sum of all values in the sample 


byz)isz = Nimiber of yaldesia hosaaple and the mean for a population (denoted by s) is 
Sum of all values in the population 
b= Number of values in the population ° 
Median 


a number that separates ordered data into halves; half the values are the same number or smaller 
than the median and half the values are the same number or larger than the median. The median may 


or may not be part of the data. 


Midpoint 
the mean of an interval in a frequency table 


Mode 
the value that appears most frequently in a set of data 


Skewness and the Mean, Median, and Mode 


Consider the following data set. 
Av 6: 6: Gi 7:7? 7: 7: 7; 7. 8 83: 9 10 


This data set can be represented by following histogram. Each interval has 
width one, and each value is located in the middle of an interval. 


4 5 6 7 8 9 10 


The histogram displays a symmetrical distribution of data. A distribution is 
symmetrical if a vertical line can be drawn at some point in the histogram 
such that the shape to the left and the right of the vertical line are mirror 
images of each other. The mean, the median, and the mode are each seven 
for these data. In a perfectly symmetrical distribution, the mean and the 
median are the same. This example has one mode (unimodal), and the 
mode is the same as the mean and median. In a symmetrical distribution 
that has two modes (bimodal), the two modes would be different from the 
mean and median. 


The histogram for the data: 4566677778 is not symmetrical. The right-hand 
side seems "chopped off" compared to the left side. A distribution of this 
type is called skewed to the left because it is pulled out to the left. 


a 5 6 r 8 


The mean is 6.3, the median is 6.5, and the mode is seven. Notice that the 
mean is less than the median, and they are both less than the mode. The 
mean and the median both reflect the skewing, but the mean reflects it more 
sO. 


The histogram for the data: 67777888910, is also not symmetrical. It is 
skewed to the right. 


6 7 8 9 10 


The mean is 7.7, the median is 7.5, and the mode is seven. Of the three 
Statistics, the mean is the largest, while the mode is the smallest. Again, 
the mean reflects the skewing the most. 


To summarize, generally if the distribution of data is skewed to the left, the 
mean is less than the median, which is often less than the mode. If the 


distribution of data is skewed to the right, the mode is often less than the 
median, which is less than the mean. 


Skewness and symmetry become important when we discuss probability 
distributions in later chapters. 


Example: 
Exercise: 


Problem: 


Statistics are used to compare and sometimes identify authors. The 
following lists shows a simple random sample that compares the letter 
counts for three authors. 


derma ees bers ra ov vedas pee noe 
DAMIS{ orp wow la eo 
MatISi 2324. 4445676-67-323 


a. Make a dot plot for the three authors and compare the shapes. 

b. Calculate the mean for each. 

c. Calculate the median for each. 

d. Describe any pattern you notice between the shape and the 
measures of center. 


Solution: 


Terry’s Letter Count 


x x KX 


Terry’s distribution has a right (positive) skew. 


Davi’s Letter Count 


x Kx KK OX 


Davis’ distribution has a left (negative) skew 


Mari’s Letter Count 


X X 
Xx X X 
X Xx X X X 


Maris’ distribution is symmetrically shaped. 


b. Terry’s mean is 3.7, Davis’ mean is 2.7, Maris’ mean is 4.6. 

c. Terry’s median is three, Davis’ median is three. Maris’ median is 
four. 

d. It appears that the median is always closest to the high point (the 
mode), while the mean tends to be farther out on the tail. Ina 
symmetrical distribution, the mean and the median are both 
centrally located close to the high point of the distribution. 


Note: 
Try It 
Exercise: 


Problem: 


Discuss the mean, median, and mode for each of the following 
problems. Is there a pattern between the shape and measure of the 
center? 


d. 
2010 Winter Olympics Gold Medal Wins by Top 20 
Medal-Winning Countries 
x 
XaeX 
eX eX ee XG x 
mG Kee Xe OX Kay ex x 
Ole Gh > SE Gye A OY Sal) SKE Se Sk 24) 
Number of gold medals won 


The Ages Former U.S Presidents Died 


4 69 

rs) Boo? 77 a 

6 003344567778 
7 0112347889 

8 01358 

) 0033 


Key: 8|0 means 80. 


Hours Spent Playing Video Games on Weekends 


e 
j=) 


9 
g 8 
s 7 
2 6 
5 5 
o 4 
2 
aed 
s 
= 2 

a 

0 

0-4.99 5-9.99 10-14.99 15-19.99 20-24.99 
Hours spent playing video games 

Solution: 


a. mean = 4.25, median = 3.5, mode = 1; The mean > median > 
mode which indicates skewness to the right. (data are 0, 1, 2, 3, 
4, 5, 6, 9, 10, 14 and respective frequencies are 2, 4, 3, 1, 2, 2, 2, 
De teal) 

b. mean = 70.1 , median = 68, mode = 57, 67 bimodal; the mean 
and median are close but there is a little skewness to the right 
which is influenced by the data being bimodal. (data are 46, 49, 
DO) D0) 57. 57. 57, 50.60) G0) Ga. 63, O4 64.165. 66267. 67, 67. 
66, 70; 71 7 72, 73.74. 77, 7G, 76, 79, 80; Gl, 83, 35; 88, 90; 
Si 3293), 

c. These are estimates: mean =16.095, median = 17.495, mode = 
22.495 (there may be no mode); The mean < median < mode 
which indicates skewness to the left. (data are the midponts of 
the intervals: 2.495, 7.495, 12.495, 17.495, 22.495 and respective 
frequencies are 2, 3, 4, 7, 9). 


Chapter Review 


Looking at the distribution of data can reveal a lot about the relationship 
between the mean, the median, and the mode. There are three types of 


distributions. A right (or positive) skewed distribution has a shape like 


[link]. A left (or negative) skewed distribution has a shape like [link]. A 
symmetrical distrubtion looks like [Link]. 


Use the following information to answer the next three exercises: State 
whether the data are symmetrical, skewed to the left, or skewed to the right. 
Exercise: 


Problem: 11122223333333344455 


Solution: 


The data are symmetrical. The median is 3 and the mean is 2.85. They 
are close, and the mode lies close to the middle of the data, so the data 
are symmetrical. 


Exercise: 


Problem: 161719222222222223 


Exercise: 


Problem:87878787878889899091 


Solution: 


The data are skewed right. The median is 87.5 and the mean is 88.2. 
Even though they are close, the mode lies to the left of the middle of 
the data, and there are many more instances of 87 than any other 
number, so the data are skewed right. 


Exercise: 
Problem: 
When the data are skewed left, what is the typical relationship between 
the mean and median? 


Exercise: 


Problem: 


When the data are symmetrical, what is the typical relationship 
between the mean and median? 


Solution: 


When the data are symmetrical, the mean and median are close or the 
same. 


Exercise: 


Problem: What word describes a distribution that has two modes? 


Exercise: 
Problem: Describe the shape of this distribution. 
10 
8 


6 


Solution: 


The distribution is skewed right because it looks pulled out to the right. 
Exercise: 
Problem: 


Describe the relationship between the mode and the median of this 
distribution. 


Exercise: 


Problem: 


Describe the relationship between the mean and the median of this 


distribution. 
10 


8 


6 


4 


Solution: 


The mean is 4.1 and is slightly greater than the median, which is four. 


Exercise: 


Problem: Describe the shape of this distribution. 


Exercise: 


Problem: 


Describe the relationship between the mode and the median of this 
distribution. 


Solution: 


The mode and the median are the same. In this case, they are both five. 
Exercise: 
Problem: 


Are the mean and the median the exact same in this distribution? Why 
or why not? 


Exercise: 


Problem: Describe the shape of this distribution. 


OrRPFNWA A DN OO 


Solution: 


The distribution is skewed left because it looks pulled out to the left. 
Exercise: 
Problem: 


Describe the relationship between the mode and the median of this 
distribution. 


OrRPFNWA UA DN OO 


Exercise: 
Problem: 


Describe the relationship between the mean and the median of this 


distribution. 
8 


OrRPFNWA ODN 


Solution: 
The mean and the median are both six. 
Exercise: 
Problem: The mean and median for the data are the same. 
345566667777777 


Is the data perfectly symmetrical? Why or why not? 


Exercise: 


Problem: 


Which is the greatest, the mean, the mode, or the median of the data 
set? 


1410929212 72131517222222 
Solution: 


The mode is 12, the median is 12.5, and the mean is 15.1. The mean is 
the largest. 
Exercise: 


Problem: 
Which is the least, the mean, the mode, and the median of the data set? 


5656565859606264646567 
Exercise: 
Problem: 


Of the three measures, which tends to reflect skewing the most, the 
mean, the mode, or the median? Why? 


Solution: 


The mean tends to reflect skewing the most because it is affected the 
most by outliers. 

Exercise: 
Problem: 


In a perfectly symmetrical distribution, when would the mode be 
different from the mean and median? 


Homework 


Exercise: 


Problem: 


The median age of the U.S. population in 1980 was 30.0 years. In 
1991, the median age was 33.1 years. 


a. What does it mean for the median age to rise? 
b. Give two reasons why the median age could rise. 


c. For the median age to rise, is the actual number of children less in 
1991 than it was in 1980? Why or why not? 


Measures of the Spread of the Data 


An important characteristic of any set of data is the variation in the data. In some data 
sets, the data values are concentrated closely near the mean; in other data sets, the data 
values are more widely spread out from the mean. The most common measure of 
variation, or spread, is the standard deviation. The standard deviation is a number that 
measures how far data values are from their mean. 


The standard deviation 


¢ provides a numerical measure of the overall amount of variation in a data set, and 
e can be used to determine whether a particular data value is close to or far from the 
mean. 


The standard deviation provides a measure of the overall variation in a data set 


The standard deviation is always positive or zero. The standard deviation is small when 
the data are all concentrated close to the mean, exhibiting little variation or spread. The 
standard deviation is larger when the data values are more spread out from the mean, 
exhibiting more variation. 


Suppose that we are studying the amount of time customers wait in line at the checkout 
at supermarket A and supermarket B. the average wait time at both supermarkets is five 
minutes. At supermarket A, the standard deviation for the wait time is two minutes; at 
supermarket B the standard deviation for the wait time is four minutes. 


Because supermarket B has a higher standard deviation, we know that there is more 
variation in the wait times at supermarket B. Overall, wait times at supermarket B are 
more spread out from the average; wait times at supermarket A are more concentrated 
near the average. 


The standard deviation can be used to determine whether a data value is close to 
or far from the mean. 


Suppose that Rosa and Binh both shop at supermarket A. Rosa waits at the checkout 
counter for seven minutes and Binh waits for one minute. At supermarket A, the mean 
waiting time is five minutes and the standard deviation is two minutes. The standard 
deviation can be used to determine whether a data value is close to or far from the 
mean. 


Rosa waits for seven minutes: 


e Seven is two minutes longer than the average of five; two minutes is equal to one 
standard deviation. 

¢ Rosa's wait time of seven minutes is two minutes longer than the average of five 
minutes. 

¢ Rosa's wait time of seven minutes is one standard deviation above the average 
of five minutes. 


Binh waits for one minute. 


e One is four minutes less than the average of five; four minutes is equal to two 
standard deviations. 

e Binh's wait time of one minute is four minutes less than the average of five 
minutes. 

¢ Binh's wait time of one minute is two standard deviations below the average of 
five minutes. 

e A data value that is two standard deviations from the average is just on the 
borderline for what many statisticians would consider to be far from the average. 
Considering data to be far from the mean if it is more than two standard deviations 
away is more of an approximate "rule of thumb" than a rigid rule. In general, the 
shape of the distribution of the data affects how much of the data is further away 
than two standard deviations. (You will learn more about this in later chapters.) 


The number line may help you understand standard deviation. If we were to put five 
and seven on a number line, seven is to the right of five. We say, then, that seven is one 
standard deviation to the right of five because 5 + (1)(2) = 7. 


If one were also part of the data set, then one is two standard deviations to the left of 
five because 5 + (—2)(2) = 1. 


0 1 2 3 a 2 6 7 


e In general, a value = mean + (#ofSTDEV)(standard deviation) 

e where #0fSTDEVs = the number of standard deviations 

#ofSTDEV does not need to be an integer 

¢ One is two standard deviations less than the mean of five because: 1 = 5 + (—2) 


(2). 


The equation value = mean + (#ofSTDEVs)(standard deviation) can be expressed for 
a sample and for a population. 


¢ sample: c = « + (#ofSTDEV)(s) 
¢ Population: x = uw + (#ofSTDEV)(c) 


The lower case letter s represents the sample standard deviation and the Greek letter o 
(sigma, lower case) represents the population standard deviation. 


The symbol z is the sample mean and the Greek symbol yp is the population mean. 


Calculating the Standard Deviation 


If x is a number, then the difference "x — mean" is called its deviation. In a data set, 
there are as many deviations as there are items in the data set. The deviations are used 
to calculate the standard deviation. If the numbers belong to a population, in symbols a 
deviation is x — yp. For sample data, in symbols a deviation is x — x. 


The procedure to calculate the standard deviation depends on whether the numbers are 
the entire population or are data from a sample. The calculations are similar, but not 
identical. Therefore the symbol used to represent the standard deviation depends on 
whether it is calculated from a population or a sample. The lower case letter s 
represents the sample standard deviation and the Greek letter o (sigma, lower case) 
represents the population standard deviation. If the sample has the same characteristics 
as the population, then s should be a good estimate of o. 


To calculate the standard deviation, we need to calculate the variance first. The 
variance is the average of the squares of the deviations (the x — x values for a 
sample, or the x — p/ values for a population). The symbol o* represents the population 
variance; the population standard deviation o is the square root of the population 
variance. The symbol s? represents the sample variance; the sample standard deviation s 
is the square root of the sample variance. You can think of the standard deviation as a 
special average of the deviations. 


If the numbers come from a census of the entire population and not a sample, when we 
calculate the average of the squared deviations to find the variance, we divide by N, the 
number of items in the population. If the data are from a sample rather than a 
population, when we calculate the average of the squared deviations, we divide by n — 
1, one less than the number of items in the sample. 


Formulas for the Sample Standard Deviation 


oe ea ag 


e For the ercke standard deviation, the denominator is n - 1, that is the sample size 
MINUS 1. 


Formulas for the Population Standard Deviation 


2 2 
‘aa / Hee) ja = i zie) 
e For the population standard deviation, the denominator is N, the number of items 
in the population. 


In these formulas, f represents the frequency with which a value appears. For example, 
if a value appears once, fis one. If a value appears three times in the data set or 
population, fis three. 


Sampling Variability of a Statistic 


The statistic of a sampling distribution was discussed in Descriptive Statistics: 
Measuring the Center of the Data. How much the statistic varies from one sample to 
another is known as the sampling variability of a statistic. You typically measure the 
sampling variability of a statistic by its standard error. The standard error of the mean 
is an example of a standard error. It is a special standard deviation and is known as the 
standard deviation of the sampling distribution of the mean. You will cover the standard 
error of the mean in the chapter The Central Limit Theorem (not now). The notation for 


the standard error of the mean is a where o is the standard deviation of the population 


and n is the size of the sample. 


Note: 

NOTE 

In practice, USE A CALCULATOR OR COMPUTER SOFTWARE TO 
CALCULATE THE STANDARD DEVIATION. If you are using a TI-83, 83+, 84+ 
calculator, you need to select the appropriate standard deviation 0, or s, from the 
summary statistics. We will concentrate on using and interpreting the information that 
the standard deviation gives us. However you should study the following step-by-step 
example to help you understand how the standard deviation measures variation from 
the mean. (The calculator instructions appear at the end of this example.) 


Example: 

In a fifth grade class, the teacher was interested in the average age and the sample 
standard deviation of the ages of her students. The following data are the ages for a 
SAMPLE of n = 20 fifth grade students. The ages are rounded to the nearest half year: 
ne me bps nw ps el IG al IL et 1 CoA A BL Drs wet Ee a LO Pa bl oa eat Tel et lan Le I et ol BS 
Ise 

Equation: 


, — 949502) + 10(4) + 105(4) + 11(6) + 11.518) 


= 10.525 
20 


The average age is 10.53 years, rounded to two places. 

The variance may be calculated by using a table. Then the standard deviation is 
calculated by taking the square root of the variance. We will explain the parts of the 
table after calculating s. 


(Freq.) 
Data Freq. Deviations Deviations? (Deviations) 
x f (x-2@) (x- 2)? (f(x - 2)? 
; ‘ 910.525 =- (-1.525)? = 1 x 2.325625 = 
1.525 2.325625 2.325625 
95 5 9.5-—10.525 = (—1.025)? = 2 X 1.050625 = 
; —1.025 1.050625 2.101250 
10 A 10 — 10.525 = — (0.525)? = 4 x 0.275625 = 
0.525 0.275625 1.1025 
10.5 A 10.5 — 10.525 = (0.025)? = 4 x 0.000625 = 
, —0.025 0.000625 0.0025 
fe : 11 - 10.525 = (0.475)? = 6 x 0.225625 = 
0.475 0.225625 1.35375 
= piae = 
115 3 11.5- 10.525 = (0.975)* = 3 x 0.950625 = 


0.975 0.950625 2.851875 


(Freq.) 
Data Freq. Deviations Deviations? (Deviations?) 


The total is 
9.7375 


The sample variance, s*, is equal to the sum of the last column (9.7375) divided by the 
total number of data values minus one (20 — 1): 


s? = £86 — 0.5125 


The sample standard deviation s is equal to the square root of the sample variance: 
s = V0.5125 = 0.715891, which is rounded to two decimal places, s = 0.72. 
Typically, you do the calculation for the standard deviation on your calculator or 
computer. The intermediate results are not rounded. This is done for accuracy. 
Exercise: 


Problem: 


For the following problems, recall that value = mean + (#ofSTDEVs) 
(standard deviation). Verify the mean and standard deviation or a calculator 
or computer. 

For a sample: x = x + (#ofSTDEVs)(s) 

For a population: x = p + (#ofSTDEVs)(o) 

e For this example, use x = x + (#ofSTDEVs)(s) because the data is from a 
sample 


a. Verify the mean and standard deviation on your calculator or computer. 

b. Find the value that is one standard deviation above the mean. Find (a + 1s). 

c. Find the value that is two standard deviations below the mean. Find (x — 2s). 

d. Find the values that are 1.5 standard deviations from (below and above) the 
mean. 


Solution: 


a. Note: 


°o Clear lists L1 and L2. Press STAT 4:ClrList. Enter 2nd 1 for L1, the 
comma (,), and 2nd 2 for L2. 

o Enter data into the list editor. Press STAT 1:EDIT. If necessary, clear 
the lists by arrowing up into the name. Press CLEAR and arrow down. 


o Put the data values (9, 9.5, 10, 10.5, 11, 11.5) into list L1 and the 
frequencies (1, 2, 4, 4, 6, 3) into list L2. Use the arrow keys to move 
around. 

Press STAT and arrow to CALC. Press 1:1-VarStats and enter L1 (2nd 
1), L2 (2nd 2). Do not forget the comma. Press ENTER. 

e = 10(525 

o Use Sx because this is sample data (not a population): Sx=0.715891 


[e) 


©) 


b. (x + 1s) = 10.53 + (1)(0.72) = 11.25 
c. (x — 2s) = 10.53 — (2)(0.72) = 9.09 


d. © (x—1.5s) = 10.53 —(1.5)(0.72) = 9.45 
o (x + 1.5s) = 10.53 + (1.5)(0.72) = 11.61 


Note: 
Try It 
Exercise: 


Problem: On a baseball team, the ages of each of the players are as follows: 
DN? 2 PAA 2 ye Ore dO Ol Oe POO oO OA OURO OOOO: 
38; 38; 38; 40 

Use your calculator or computer to find the mean and standard deviation. Then 
find the value that is two standard deviations above the mean. 

Solution: 

p= 30.68 


5 = 6.09 
(x + 2s) = 30.68 + (2)(6.09) = 42.86. 


Explanation of the standard deviation calculation shown in the table 


The deviations show how spread out the data are about the mean. The data value 11.5 is 
farther from the mean than is the data value 11 which is indicated by the deviations 0.97 
and 0.47. A positive deviation occurs when the data value is greater than the mean, 
whereas a negative deviation occurs when the data value is less than the mean. The 
deviation is —1.525 for the data value nine. If you add the deviations, the sum is 
always zero. (For [link], there are n = 20 deviations.) So you cannot simply add the 
deviations to get the spread of the data. By squaring the deviations, you make them 
positive numbers, and the sum will also be positive. The variance, then, is the average 
squared deviation. 


The variance is a squared measure and does not have the same units as the data. Taking 
the square root solves the problem. The standard deviation measures the spread in the 
same units as the data. 


Notice that instead of dividing by n = 20, the calculation divided by n— 1 = 20—1=19 
because the data is a sample. For the sample variance, we divide by the sample size 
minus one (n— 1). Why not divide by n? The answer has to do with the population 
variance. The sample variance is an estimate of the population variance. Based on 
the theoretical mathematics that lies behind these calculations, dividing by (n — 1) gives 
a better estimate of the population variance. 


Note: 

NOTE 

Your concentration should be on what the standard deviation tells us about the data. 
The standard deviation is a number which measures how far the data are spread from 
the mean. Let a calculator or computer do the arithmetic. 


The standard deviation, s or o, is either zero or larger than zero. Describing the data 
with reference to the spread is called "variability". The variability in data depends upon 
the method by which the outcomes are obtained; for example, by measuring or by 
random sampling. When the standard deviation is zero, there is no spread; that is, the all 
the data values are equal to each other. The standard deviation is small when the data 
are all concentrated close to the mean, and is larger when the data values show more 
variation from the mean. When the standard deviation is a lot larger than zero, the data 
values are very spread out about the mean; outliers can make s or o very large. 


The standard deviation, when first presented, can seem unclear. By graphing your data, 
you can get a better "feel" for the deviations and the standard deviation. You will find 
that in symmetrical distributions, the standard deviation can be very helpful but in 
skewed distributions, the standard deviation may not be much help. The reason is that 


the two sides of a skewed distribution have different spreads. In a skewed distribution, 
it is better to look at the first quartile, the median, the third quartile, the smallest value, 
and the largest value. Because numbers can be confusing, always graph your data. 
Display your data in a histogram or a box plot. 


Example: 
Exercise: 


Problem: 


Use the following data (first exam scores) from Susan Dean's spring pre-calculus 
class: 


33: 42; 49; 49° 53° 55) 55° Ol; 63; G7; GG; G8; 69: G9; 72; 73° 74; 73) 80; 63208; 
Stee toro Sle ele Slip eile yale eye Sloe 1000 


a. Create a chart containing the data, frequencies, relative frequencies, and 
cumulative relative frequencies to three decimal places. 

b. Calculate the following to one decimal place using a TI-83+ or TI-84 
calculator: 


i. The sample mean 
ii. The sample standard deviation 
iii. The median 
iv. The first quartile 
v. The third quartile 
vi. IQR 


c. Construct a box plot and a histogram on the same set of axes. Make 
comments about the box plot, the histogram, and the chart. 


Solution: 
a. See [link] 


b. i. The sample mean = 73.5 
ii. The sample standard deviation = 17.9 
iii. The median = 73 
iv. The first quartile = 61 
v. The third quartile = 90 
vi. IQR = 90 — 61 = 29 


c. The x-axis goes from 32.5 to 100.5; y-axis goes from —2.4 to 15 for the 
histogram. The number of intervals is five, so the width of an interval is 
(100.5 — 32.5) divided by five, is equal to 13.6. Endpoints of the intervals are 
as follows: the starting point is 32.5, 32.5 + 13.6 = 46.1, 46.1 + 13.6 = 59.7, 
59.7 + 13.6 = 73.3, 73.3 + 13.6 = 86.9, 86.9 + 13.6 = 100.5 = the ending 
value; No data values fall on an interval boundary. 


— = ee is 


32.5 46.1 59.7 73.373.5 86.9 100.5 


The long left whisker in the box plot is reflected in the left side of the histogram. The 
spread of the exam scores in the lower 50% is greater (73 — 33 = 40) than the spread in 
the upper 50% (100 — 73 = 27). The histogram, box plot, and chart all reflect this. 
There are a substantial number of A and B grades (80s, 90s, and 100). The histogram 
clearly shows this. The box plot shows us that the middle 50% of the exam scores (IQR 
= 29) are Ds, Cs, and Bs. The box plot also shows us that the lower 25% of the exam 
scores are Ds and Fs. 


Relative Cumulative Relative 
Data Frequency Frequency Frequency 
33 1 0.032 0.032 
42 1 0.032 0.064 
49 2 0.065 0.129 
53 1 0.032 0.161 


55 2 0.065 0.226 


Relative Cumulative Relative 


Data Frequency Frequency Frequency 

61 1 0.032 0.258 

63 1 0.032 0.29 

67 1 0.032 0.322 

68 2 0.065 0.387 

69 2 0.065 0.452 

es 1 0.032 0.484 

73 1 0.032 0.516 

74 1 0.032 0.548 

78 1 0.032 0.580 

80 1 0.032 0.612 

83 1 0.032 0.644 

88 3 0.097 0.741 

90 1 0.032 0.773 

92 1 0.032 0.805 

94 4 0.129 0.934 

96 1 0.032 0.966 

100 df 0.032 0.998 (Why isn't this value 1?) 
Note: 
Try It 


Exercise: 


Problem: 


The following data show the different types of pet food stores in the area carry. 
66765057) (7. 77-399 Oe OF OO Os Otel tale sl ade eee 
Pe 

Calculate the sample mean and the sample standard deviation to one decimal 
place using a TI-83+ or TI-84 calculator. 


Solution: 
p=9.3 
s=2.2 


Standard deviation of Grouped Frequency Tables 


Recall that for grouped data we do not know individual data values, so we cannot 
describe the typical value of the data with precision. In other words, we cannot find the 
exact mean, median, or mode. We can, however, determine the best estimate of the 
measures of center by finding the mean of the grouped data with the formula: 

fm 
Mean of Frequency Table = arn 


where f = interval frequencies and m = interval midpoints. 


Just as we could not find the exact mean, neither can we find the exact standard 
deviation. Remember that standard deviation describes numerically the expected 
deviation a data value has from the mean. In simple English, the standard deviation 
allows us to compare how “unusual” individual data is compared to the mean. 


Example: 
Find the standard deviation for the data in [link]. 


Frequency, Midpoint, Standard 
Class f m m? a? fm? Deviation 


Frequency, Midpoint, Standard 


Class f m m2 x fm? Deviation 
0-2 1 1 1 758 | 1 aus 
Bes 6 4 oo zeae sie 0) es 
6-8 10 7 49 | 7.58 | 490 | 35 
Seen ie 10 100 | 758 | 700 | 35 
a 0 13 Gomi e7 SBaG Ss 
Sn 2 16 Was || || ese ||) eve 


For this data set, we have the mean, x = 7.58 and the standard deviation, s, = 3.5. This 
means that a randomly selected data value would be expected to be 3.5 units from the 
mean. If we look at the first class, we see that the class midpoint is equal to one. This 
is almost two full standard deviations from the mean since 7.58 — 3.5 — 3.5 = 0.58. 
While the formula for calculating the standard deviation is not complicated, 


2 
a MO where s, = sample standard deviation, z = sample mean, the 


calculations are tedious. It is usually best to use technology when performing the 
calculations. 


Note: 
Try It 
Find the standard deviation for the data from the previous example 


Class Frequency, f 


0-2 1 


Class Frequency, f 


32 6 
6-8 10 
9-11 He 
12-14 0 
15-17 2 


First, press the STAT key and select 1:Edit 


Input the midpoint values into L1 and the frequencies into L2 


Select STAT, CALC, and 1: 1-Var Stats 


Select 2™ then 1 then , 2"@ then 2 Enter 


You will see displayed both a population standard deviation, 0x, and the sample 
standard deviation, sy. 


Comparing Values from Different Data Sets 


The standard deviation is useful when comparing data values that come from different 
data sets. If the data sets have different means and standard deviations, then comparing 
the data values directly can be misleading. 


e For each data value, calculate how many standard deviations away from its mean 
the value is. 

e Use the formula: value = mean + (#ofSTDEVs)(standard deviation); solve for 
#ofSTDEVs. 

; #ofSTDEVs = value — mean 


standard deviation 
¢ Compare the results of this calculation. 


#ofSTDEVs is often called a "z-score"; we can use the symbol z. In symbols, the 
formulas become: 


= —_ &-2 
Sample x=x2+25 z= 
: = _ &-p 
Population x=pt+zo z=— 
Example: 


Exercise: 


Problem: 


Two students, John and Ali, from different high schools, wanted to find out who 
had the highest GPA when compared to his school. Which student had the highest 
GPA when compared to his school? 


School Mean School Standard 
Student GPA GPA Deviation 
John 2.85 3.0 0.7 
Ali 77 80 10 


Solution: 


For each student, determine how many standard deviations (#ofSTDEVs) his GPA 
is away from the average, for his school. Pay careful attention to signs when 
comparing and interpreting the answer. 


z — of STDEVs= value —mean _— cH 


standard deviation oO 


For jones =o, o OE Vg Oe 


. 77-80 
For Ali, z = #ofSTDEVs = —— = —0.3 

John has the better GPA when compared to his school because his GPA is 0.21 
standard deviations below his school's mean while Ali's GPA is 0.3 standard 
deviations below his school's mean. 


John's z-score of —0.21 is higher than Ali's z-score of —0.3. For GPA, higher 
values are better, so we conclude that John has the better GPA when compared to 
his school. 


Note: 


Try It 
Exercise: 


Problem: 
Two swimmers, Angie and Beth, from different teams, wanted to find out who 


had the fastest time for the 50 meter freestyle when compared to her team. Which 
swimmer had the fastest time when compared to her team? 


Time Team Mean Team Standard 
Swimmer (seconds) Time Deviation 
Angie 26.2 PAYED 0.8 
Beth 27.3 30.1 1.4 


Solution: 
For Angie: z = ee =-1.25 


For Beth: z = ae =—2 


The following lists give a few facts that provide a little more insight into what the 
standard deviation tells us about the distribution of the data. 
For ANY data set, no matter what the distribution of the data is: 


e At least 75% of the data is within two standard deviations of the mean. 
e At least 89% of the data is within three standard deviations of the mean. 
e At least 95% of the data is within 4.5 standard deviations of the mean. 

e This is known as Chebyshev's Rule. 


For data having a distribution that is BELL-SHAPED and SYMMETRIC: 


e Approximately 68% of the data is within one standard deviation of the mean. 
e Approximately 95% of the data is within two standard deviations of the mean. 


More than 99% of the data is within three standard deviations of the mean. 

This is known as the Empirical Rule. 

It is important to note that this rule only applies when the shape of the distribution 
of the data is bell-shaped and symmetric. We will learn more about this when 
studying the "Normal" or "Gaussian" probability distribution in later chapters. 


References 
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Chapter Review 


The standard deviation can help you calculate the spread of data. There are different 
equations to use if are calculating the standard deviation of a sample or of a population. 


e The Standard Deviation allows us to compare individual data or classes to the data 
set mean numerically. 


pha ESD tee 


deviation of a sample. To calculate the standard deviation of a population, we 


is the formula for calculating the standard 


(e—p)” 
would use the population mean, p, and the formula o = 1p or 0 = 


i So f(a—p)? 
age 
Formula Review 


So fm? ; S$, = sample standard deviation 
Sz = _— — £* where 
x = sample mean 


Use the following information to answer the next two exercises: The following data are 
the distances between 20 retail stores and a large distribution center. The distances are 
in miles. 

29; 37; 38; 40; 58; 67; 68; 69; 76; 86; 87; 95; 96; 96; 99; 106; 112; 127; 145; 150 
Exercise: 


Problem: 


Use a graphing calculator or computer to find the standard deviation and round to 
the nearest tenth. 


Solution: 


s= 345 


Exercise: 


Problem: Find the value that is one standard deviation below the mean. 
Exercise: 

Problem: 

Two baseball players, Fredo and Karl, on different teams wanted to find out who 


had the higher batting average when compared to his team. Which baseball player 
had the higher batting average when compared to his team? 


Baseball Batting Team Batting Team Standard 
Player Average Average Deviation 
Fredo 0.158 0.166 0.012 
Karl 0.177 0.189 0.015 

Solution: 


. 7 = 0.158-0.166 — 
For Fredo: 2 = “Gai = = 0:67 
. 7 = 0.177-0.189 — 
For Kak 2= Ge 
Fredo’s z-score of —0.67 is higher than Karl’s z-score of —0.8. For batting average, 


higher values are better, so Fredo has a better batting average compared to his 
team. 


Exercise: 


Problem: Use [link] to find the value that is three standard deviations: 


e aabove the mean 
e bbelow the mean 


Find the standard deviation for the following frequency tables using the formula. Check 
the calculations with the TI 83/84. 
Exercise: 


Problem: 


Find the standard deviation for the following frequency tables using the formula. 
Check the calculations with the TI 83/84. 


qa. Grade Frequency 
49.5-59.5 2 
29.5-69.5 3 
695-79.) 8 
79.3-89.5 12 
89.5=99.5 5 
b. Daily Low Temperature Frequency 


49.5-59.5 eps, 


Daily Low Temperature Frequency 


59.5-69.5 32 
69.5-79.5 15 
79.5—89.5 | 
89.5-99.5 0 
c. Points per Game Frequency 
49.5-59.5 14 
59.5—69.5 ce 
69.5-79.5 15 
79.5—89.5 2S 
89.5-99.5 Z 
Solution: 


ee SS fm? ee / 18158 — 79 52 = 10.88 
Pe n 30 7 : 


‘m2 
bes = 4) 22 =4) = / S08 — 60.94? = 7.62 


m2 GARAE ED 
Csn=4/ 2 ep sf s10081.5 — 70.667 = 11.14 


n 


Homework 


Use the following information to answer the next nine exercises: The population 
parameters below describe the full-time equivalent number of students (FTES) each 
year at Lake Tahoe Community College from 1976-1977 through 2004—2005. 


e w= 1000 FTES 

median = 1,014 FTES 

0 = 474 FTES 

first quartile = 528.5 FTES 
third quartile = 1,447.5 FTES 
n= 29 years 


Exercise: 


Problem: 


A sample of 11 years is taken. About how many are expected to have a FTES of 
1014 or above? Explain how you determined your answer. 


Solution: 


The median value is the middle value in the ordered list of data values. The 
median value of a set of 11 will be the 6th number in order. Six years will have 
totals at or below the median. 


Exercise: 
Problem: 75% of all years have an FTES: 


a. at or below: 
b. at or above: 


Exercise: 


Problem: The population standard deviation = 


Solution: 


474 FTES 
Exercise: 


Problem: 


What percent of the FTES were from 528.5 to 1447.5? How do you know? 


Exercise: 


Problem: What is the IQR? What does the JQR represent? 


Solution: 


919 


Exercise: 


Problem: How many standard deviations away from the mean is the median? 


Additional Information: The population FTES for 2005-2006 through 2010-2011 
was given in an updated report. The data are reported here. 


2005— 2006— 2007— 2008— 2009— 2010— 
Year 


06 07 08 09 10 11 
Total 1,585 1,690 1,735 1,935 2,021 1,890 
FTES ¥. ’ ’ ’ ’ ’ 

Exercise: 

Problem: 


Calculate the mean, median, standard deviation, the first quartile, the third quartile 
and the IQR. Round to one decimal place. 


Solution: 


e mean = 1,809.3 

median = 1,812.5 
standard deviation = 151.2 
first quartile = 1,690 

third quartile = 1,935 

IQR = 245 


Exercise: 


Problem: 


What additional information is needed to construct a box plot for the FTES for 
2005-2006 through 2010-2011 and a box plot for the FTES for 1976-1977 through 
2004-2005? 


Exercise: 
Problem: 
Compare the JQR for the FTES for 1976—77 through 2004—2005 with the IQR for 


the FTES for 2005-2006 through 2010—2011. Why do you suppose the IQRs are so 
different? 


Solution: 
Hint: Think about the number of years covered by each time period and what 
happened to higher education during those periods. 

Exercise: 
Problem: 
Three students were applying to the same graduate school. They came from 
schools with different grading systems. Which student had the best GPA when 


compared to other students at his school? Explain how you determined your 
answer. 


School Average School Standard 
Student GPA GPA Deviation 
Thuy 237. Be2 0.8 
Vichet 87 75 20 
Kamala 8.6 8 0.4 


Exercise: 


Problem: 


A music school has budgeted to purchase three musical instruments. They plan to 
purchase a piano costing $3,000, a guitar costing $550, and a drum set costing 
$600. The mean cost for a piano is $4,000 with a standard deviation of $2,500. 
The mean cost for a guitar is $500 with a standard deviation of $200. The mean 
cost for drums is $700 with a standard deviation of $100. Which cost is the lowest, 
when compared to other instruments of the same type? Which cost is the highest 
when compared to other instruments of the same type. Justify your answer. 


Solution: 


For pianos, the cost of the piano is 0.4 standard deviations BELOW the mean. For 
guitars, the cost of the guitar is 0.25 standard deviations ABOVE the mean. For 
drums, the cost of the drum set is 1.0 standard deviations BELOW the mean. Of 
the three, the drums cost the lowest in comparison to the cost of other instruments 
of the same type. The guitar costs the most in comparison to the cost of other 
instruments of the same type. 


Exercise: 


Problem: 


An elementary school class ran one mile with a mean of 11 minutes and a standard 
deviation of three minutes. Rachel, a student in the class, ran one mile in eight 
minutes. A junior high school class ran one mile with a mean of nine minutes and 
a standard deviation of two minutes. Kenji, a student in the class, ran 1 mile in 8.5 
minutes. A high school class ran one mile with a mean of seven minutes and a 
standard deviation of four minutes. Nedda, a student in the class, ran one mile in 
eight minutes. 


a. Why is Kenji considered a better runner than Nedda, even though Nedda ran 
faster than he? 
b. Who is the fastest runner with respect to his or her class? Explain why. 


Exercise: 
Problem: 


The most obese countries in the world have obesity rates that range from 11.4% to 
74.6%. This data is summarized in Table 14. 


Percent of Population Obese 
11.4—20.45 

20.45-29.45 

29.45-38.45 

38.45-47.45 

47.45-56.45 

56.45-65.45 

65.45—74.45 


74.45-83.45 


What is the best estimate of the average obesity percentage for these countries? 
What is the standard deviation for the listed obesity rates? The United States has 
an average obesity rate of 33.9%. Is this rate above average or below? How 
“unusual” is the United States’ obesity rate compared to the average rate? Explain. 


Solution: 


e x = 23.32 


Number of Countries 


29 


13 


e Using the TI 83/84, we obtain a standard deviation of: s, = 12.95. 


e The obesity rate of the United States is 10.58% higher than the average 


obesity rate. 


e Since the standard deviation is 12.95, we see that 23.32 + 12.95 = 36.27 is the 
obesity percentage that is one standard deviation from the mean. The United 
States obesity rate is slightly less than one standard deviation from the mean. 
Therefore, we can assume that the United States, while 34% obese, does not 


hav e an unusually high percentage of obese people. 


Exercise: 


Problem: 


[link] gives the percent of children under five considered to be underweight. 


Percent of Underweight Children Number of Countries 


16—21.45 23 
21.45-26.9 4 
26.9-32.35 9 
32.35-37.8 7 
37.8-43.25 6 
43.25-48.7 1 


What is the best estimate for the mean percentage of underweight children? What 
is the standard deviation? Which interval(s) could be considered unusual? Explain. 
Bringing It Together 


Exercise: 


Problem: 


Twenty-five randomly selected students were asked the number of movies they 
watched the previous week. The results are as follows: 


# of movies Frequency 
0 5 
1 9 
2 6 


# of movies Frequency 


4 1 


a. Find the sample mean z. 
b. Find the approximate sample standard deviation, s. 


Solution: 


a. 1.48 
bs 112 


Exercise: 
Problem: 


Forty randomly selected students were asked the number of pairs of sneakers they 
owned. Let X = the number of pairs of sneakers owned. The results are as follows: 


X Frequency 
1 2 

2 5 

3 8 

4 12 

5 12 

6 0 

7 1 


a. Find the sample mean x 


b. Find the sample standard deviation, s 
c. Construct a histogram of the data. 

d. Complete the columns of the chart. 

e. Find the first quartile. 

f. Find the median. 

g. Find the third quartile. 

h. Construct a box plot of the data. 

i. What percent of the students owned at least five pairs? 
j. Find the 40" percentile. 

k. Find the 90" percentile. 

|. Construct a line graph of the data 
m. Construct a stemplot of the data 


Exercise: 


Problem: 


Following are the published weights (in pounds) of all of the team members of the 
San Francisco 49ers from a previous year. 


177; 205; 210; 210; 232; 205; 185; 185; 178; 210; 206; 212; 184; 174; 185; 242; 
188; 212; 215; 247; 241; 223; 220; 260; 245; 259; 278; 270; 280; 295; 275; 285; 
290; 272; 273; 280; 285; 286; 200; 215; 185; 230; 250; 241; 190; 260; 250; 302; 
2603 290; 276;:228;-265 


a. Organize the data from smallest to largest value. 

b. Find the median. 

c. Find the first quartile. 

d. Find the third quartile. 

e. Construct a box plot of the data. 

f. The middle 50% of the weights are from to 

g. If our population were all professional football players, would the above data 
be a sample of weights or the population of weights? Why? 

h. If our population included every team member who ever played for the San 
Francisco 49ers, would the above data be a sample of weights or the 
population of weights? Why? 

i. Assume the population was the San Francisco 49ers. Find: 


i. the population mean, p. 
ii. the population standard deviation, o. 
iii. the weight that is two standard deviations below the mean. 
iv. When Steve Young, quarterback, played football, he weighed 205 
pounds. How many standard deviations above or below the mean was 
he? 


j. That same year, the mean weight for the Dallas Cowboys was 240.08 pounds 
with a standard deviation of 44.38 pounds. Emmit Smith weighed in at 209 
pounds. With respect to his team, who was lighter, Smith or Young? How did 
you determine your answer? 


Solution: 


a. 174; 177; 178; 184; 185; 185; 185; 185; 188; 190; 200; 205; 205; 206; 210; 
2103210; 212; 212; 215; 215; 2205 223; 226; 230; 232; 241: 241; 242; 245; 
2475250; 2503-259; 2605 260; 2655/2695 270;272).2735.275;.276;-2/8; 200; 
2805285; 285; 286; 290; 290; 295; 302 

b. 241 

205.5 

d. 272.5 


174 205.5 241 272.5 302 


£-205:0,;272.5 
g. sample 
h. population 


i - 1,236.34 
lie a0 
iii. 161.34 
iv. 0.84 std. dev. below the mean 


j. Young 


Exercise: 


Problem: 


One hundred teachers attended a seminar on mathematical problem solving. The 
attitudes of a representative sample of 12 of the teachers were measured before 
and after the seminar. A positive number for change in attitude indicates that a 
teacher's attitude toward math became more positive. The 12 change scores are as 
follows: 


3 8-12 05-31-16 5-2 


a. What is the mean change score? 

b. What is the standard deviation for this population? 

c. What is the median change score? 

d. Find the change score that is 2.2 standard deviations below the mean. 


Exercise: 


Problem: 


Refer to [link] determine which of the following are true and which are false. 
Explain your solution to each part in complete sentences. 


(a) (b) (c) 


a. The medians for all three graphs are the same. 

b. We cannot determine if any of the means for the three graphs is different. 

c. The standard deviation for graph b is larger than the standard deviation for 
graph a. 

d. We cannot determine if any of the third quartiles for the three graphs is 
different. 


Solution: 


a. True 
b. True 
c. True 
d. False 


Exercise: 


Problem: 


In a recent issue of the IEEE Spectrum, 84 engineering conferences were 
announced. Four conferences lasted two days. Thirty-six lasted three days. 
Eighteen lasted four days. Nineteen lasted five days. Four lasted six days. One 
lasted seven days. One lasted eight days. One lasted nine days. Let X = the length 
(in days) of an engineering conference. 


a. Organize the data in a chart. 


b. Find the median, the first quartile, and the third quartile. 

c. Find the 65" percentile. 

d. Find the 10" percentile. 

e. Construct a box plot of the data. 

f. The middle 50% of the conferences last from days to days. 

g. Calculate the sample mean of days of engineering conferences. 

h. Calculate the sample standard deviation of days of engineering conferences. 

i. Find the mode. 

j. If you were planning an engineering conference, which would you choose as 
the length of the conference: mean; median; or mode? Explain why you made 
that choice. 

k. Give two reasons why you think that three to five days seem to be popular 
lengths of engineering conferences. 


Exercise: 


Problem: 


A survey of enrollment at 35 community colleges across the United States yielded 
the following figures: 


6414; 1550; 2109; 9350; 21828; 4300; 5944; 5722; 2825; 2044; 5481; 5200; 5853; 
2750; 10012; 6357; 27000; 9414; 7681; 3200; 17500; 9200; 7380; 18314; 6557; 
13713; 17768; 7493; 2771; 2861; 1263; 7285; 28165; 5080; 11622 


a. Organize the data into a chart with five intervals of equal width. Label the 

two columns "Enrollment" and "Frequency." 

Construct a histogram of the data. 

c. If you were to build a new community college, which piece of information 

would be more valuable: the mode or the mean? 

d. Calculate the sample mean. 

. Calculate the sample standard deviation. 

. A school with an enrollment of 8000 would be how many standard deviations 
away from the mean? 


oS. 


eh OD 


Solution: 


a. Enrollment Frequency 


1000-5000 10 
5000-10000 16 
10000-15000 3 
15000-20000 3 
20000-25000 1 
25000-30000 2 


b. Check student’s solution. 
c. mode 

d. 8628.74 

e. 6943.88 

f. -0.09 


Use the following information to answer the next two exercises. X = the number of days 
per week that 100 clients use a particular exercise facility. 


x Frequency 
0 3 

1 12 

2 33 

3 28 


4 11 


x Frequency 


5 9 
6 4 
Exercise: 


Problem: The 80" percentile is 


an op 
S 


RWO UI 


Exercise: 


Problem: 


The number that is 1.5 standard deviations BELOW the mean is approximately 


a. 0.7 

b. 4.8 

c. —2.8 

d. Cannot be determined 


Solution: 


a 
Exercise: 

Problem: 

Suppose that a publisher conducted a survey asking adult consumers the number of 


fiction paperback books they had purchased in the previous month. The results are 
summarized in the [link]. 


# of books Freq. Rel. Freq. 


0 18 
if 24 
2 24 
3 22 
4 15 
5 10 
7 5 

9 1 


a. Are there any outliers in the data? Use an appropriate numerical test 
involving the IQR to identify outliers, if any, and clearly state your 
conclusion. 

b. If a data value is identified as an outlier, what should be done about it? 

c. Are any data values further than two standard deviations away from the 
mean? In some situations, statisticians may use this criteria to identify data 
values that are unusual, compared to the other data values. (Note that this 
criteria is most appropriate to use for data that is mound-shaped and 
symmetric, rather than for skewed data.) 

d. Do parts a and c of this problem give the same answer? 

e. Examine the shape of the data. Which part, a or c, of this question gives a 
more appropriate result for this data? 

f. Based on the shape of the data which is the most appropriate measure of 
center for this data: mean, median or mode? 


Glossary 


Standard Deviation 
a number that is equal to the square root of the variance and measures how far data 
values are from their mean; notation: s for sample standard deviation and o for 
population standard deviation. 


Variance 
mean of the squared deviations from the mean, or the square of the standard 
deviation; for a set of data, a deviation can be represented as x — x where x is a 
value of the data and x is the sample mean. The sample variance is equal to the 
sum of the squares of the deviations divided by the difference of the sample size 
and one. 


Descriptive Statistics 


Note: 

Descriptive Statistics 

Class Time: 

Names: 

Student Learning Outcomes 


e The student will construct a histogram and a box plot. 
e The student will calculate univariate statistics. 
e The student will examine the graphs to interpret what the data implies. 


Collect the Data 
Record the number of pairs of shoes you own. 


1. Randomly survey 30 classmates about the number of pairs of shoes 
they own. Record their values. 


Survey Results 


2. Construct a histogram. Make five to six intervals. Sketch the graph 
using a ruler and pencil and scale the axes. 


Frequency 


Number of pairs of shoes 


3. Calculate the following values. 


BY 
Se 


4. Are the data discrete or continuous? How do you know? 

In complete sentences, describe the shape of the histogram. 

Are there any potential outliers? List the value(s) that could be 
outliers. Use a formula to check the end values to determine if they 
are potential outliers. 


oes 


Analyze the Data 
1. Determine the following values. 


a. Min = 
ya 


c. Max = 
d. Q, = 
e. Q3 — 
f. IQR= 


2. Construct a box plot of data 

3. What does the shape of the box plot imply about the concentration of 
data? Use complete sentences. 

4. Using the box plot, how can you determine if there are potential 
outliers? 


5. How does the standard deviation help you to determine concentration 
of the data and whether or not there are potential outliers? 

6. What does the JQR represent in this problem? 

7. Show your work to find the value that is 1.5 standard deviations: 


a. above the mean. 
b. below the mean. 


